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Preface 


In ALL content, please indent the first paragraph as well, like the following ones. My motivation 
for writing the first edition of Introductory Econometrics: A Modern Approach was that I saw a fairly 
wide gap between how econometrics is taught to undergraduates and how empirical researchers think 
about and apply econometric methods. I became convinced that teaching introductory econometrics 
from the perspective of professional users of econometrics would actually simplify the presentation, 
in addition to making the subject much more interesting. 

Based on the positive reactions to the several earlier editions, it appears that my hunch was correct. 
Many instructors, having a variety of backgrounds and interests and teaching students with different 
levels of preparation, have embraced the modern approach to econometrics espoused in this text. The 
emphasis in this edition is still on applying econometrics to real-world problems. Each econometric 
method is motivated by a particular issue facing researchers analyzing nonexperimental data. The focus 
in the main text is on understanding and interpreting the assumptions in light of actual empirical appli- 
cations: the mathematics required is no more than college algebra and basic probability and statistics. 


Designed for Today’s Econometrics Course 


xii 


The seventh edition preserves the overall organization of the sixth. The most noticeable feature 
that distinguishes this text from most others is the separation of topics by the kind of data being ana- 
lyzed. This is a clear departure from the traditional approach, which presents a linear model, lists all 
assumptions that may be needed at some future point in the analysis, and then proves or asserts results 
without clearly connecting them to the assumptions. My approach is first to treat, in Part 1, mul- 
tiple regression analysis with cross-sectional data, under the assumption of random sampling. This 
setting is natural to students because they are familiar with random sampling from a population in 
their introductory statistics courses. Importantly, it allows us to distinguish assumptions made about 
the underlying population regression model—assumptions that can be given economic or behavioral 
content—from assumptions about how the data were sampled. Discussions about the consequences of 
nonrandom sampling can be treated in an intuitive fashion after the students have a good grasp of the 
multiple regression model estimated using random samples. 

An important feature of a modern approach is that the explanatory variables—along with the 
dependent variable—are treated as outcomes of random variables. For the social sciences, allow- 
ing random explanatory variables is much more realistic than the traditional assumption of nonran- 
dom explanatory variables. As a nontrivial benefit, the population model/random sampling approach 
reduces the number of assumptions that students must absorb and understand. Ironically, the classical 
approach to regression analysis, which treats the explanatory variables as fixed in repeated samples 
and is still pervasive in introductory texts, literally applies to data collected in an experimental setting. 
In addition, the contortions required to state and explain assumptions can be confusing to students. 

My focus on the population model emphasizes that the fundamental assumptions underlying 
regression analysis, such as the zero mean assumption on the unobservable error term, are properly 
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stated conditional on the explanatory variables. This leads to a clear understanding of the kinds of 
problems, such as heteroskedasticity (nonconstant variance), that can invalidate standard inference 
procedures. By focusing on the population, I am also able to dispel several misconceptions that arise 
in econometrics texts at all levels. For example, I explain why the usual R-squared is still valid as a 
goodness-of-fit measure in the presence of heteroskedasticity (Chapter 8) or serially correlated errors 
(Chapter 12); I provide a simple demonstration that tests for functional form should not be viewed 
as general tests of omitted variables (Chapter 9); and I explain why one should always include in a 
regression model extra control variables that are uncorrelated with the explanatory variable of inter- 
est, which is often a key policy variable (Chapter 6). 

Because the assumptions for cross-sectional analysis are relatively straightforward yet realis- 
tic, students can get involved early with serious cross-sectional applications without having to worry 
about the thorny issues of trends, seasonality, serial correlation, high persistence, and spurious regres- 
sion that are ubiquitous in time series regression models. Initially, I figured that my treatment of 
regression with cross-sectional data followed by regression with time series data would find favor 
with instructors whose own research interests are in applied microeconomics, and that appears to be 
the case. It has been gratifying that adopters of the text with an applied time series bent have been 
equally enthusiastic about the structure of the text. By postponing the econometric analysis of time 
series data, I am able to put proper focus on the potential pitfalls in analyzing time series data that do 
not arise with cross-sectional data. In effect, time series econometrics finally gets the serious treat- 
ment it deserves in an introductory text. 

As in the earlier editions, I have consciously chosen topics that are important for reading journal 
articles and for conducting basic empirical research. Within each topic, I have deliberately omitted 
many tests and estimation procedures that, while traditionally included in textbooks, have not with- 
stood the empirical test of time. Likewise, I have emphasized more recent topics that have clearly 
demonstrated their usefulness, such as obtaining test statistics that are robust to heteroskedasticity 
(or serial correlation) of unknown form, using multiple years of data for policy analysis, or solving 
the omitted variable problem by instrumental variables methods. I appear to have made fairly good 
choices, as I have received only a handful of suggestions for adding or deleting material. 

I take a systematic approach throughout the text, by which I mean that each topic is presented by 
building on the previous material in a logical fashion, and assumptions are introduced only as they 
are needed to obtain a conclusion. For example, empirical researchers who use econometrics in their 
research understand that not all of the Gauss-Markov assumptions are needed to show that the ordi- 
nary least squares (OLS) estimators are unbiased. Yet the vast majority of econometrics texts intro- 
duce a complete set of assumptions (many of which are redundant or in some cases even logically 
conflicting) before proving the unbiasedness of OLS. Similarly, the normality assumption is often 
included among the assumptions that are needed for the Gauss-Markov Theorem, even though it is 
fairly well known that normality plays no role in showing that the OLS estimators are the best linear 
unbiased estimators. 

My systematic approach is illustrated by the order of assumptions that I use for multiple regres- 
sion in Part 1. This structure results in a natural progression for briefly summarizing the role of each 
assumption: 


MLR.1: Introduce the population model and interpret the population parameters (which we hope 
to estimate). 

MLR.2: Introduce random sampling from the population and describe the data that we use to 
estimate the population parameters. 

MLR.3: Add the assumption on the explanatory variables that allows us to compute the estimates 
from our sample; this is the so-called no perfect collinearity assumption. 

MLR.4: Assume that, in the population, the mean of the unobservable error does not depend on the 
values of the explanatory variables; this is the “mean independence” assumption combined with a 
zero population mean for the error, and it is the key assumption that delivers unbiasedness of OLS. 
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After introducing Assumptions MLR.1 to MLR.3, one can discuss the algebraic properties of ordi- 
nary least squares—that is, the properties of OLS for a particular set of data. By adding Assumption 
MLR.4, we can show that OLS is unbiased (and consistent). Assumption MLR.5 (homoskedastic- 
ity) is added for the Gauss-Markov Theorem and for the usual OLS variance formulas to be valid. 
Assumption MLR.6 (normality), which is not introduced until Chapter 4, is added to round out the 
classical linear model assumptions. The six assumptions are used to obtain exact statistical inference 
and to conclude that the OLS estimators have the smallest variances among all unbiased estimators. 

I use parallel approaches when I turn to the study of large-sample properties and when I treat 
regression for time series data in Part 2. The careful presentation and discussion of assumptions 
makes it relatively easy to transition to Part 3, which covers advanced topics that include using pooled 
cross-sectional data, exploiting panel data structures, and applying instrumental variables methods. 
Generally, I have strived to provide a unified view of econometrics, where all estimators and test sta- 
tistics are obtained using just a few intuitively reasonable principles of estimation and testing (which, 
of course, also have rigorous justification). For example, regression-based tests for heteroskedasticity 
and serial correlation are easy for students to grasp because they already have a solid understanding 
of regression. This is in contrast to treatments that give a set of disjointed recipes for outdated econo- 
metric testing procedures. 

Throughout the text, I emphasize ceteris paribus relationships, which is why, after one chapter on 
the simple regression model, I move to multiple regression analysis. The multiple regression setting 
motivates students to think about serious applications early. I also give prominence to policy analysis 
with all kinds of data structures. Practical topics, such as using proxy variables to obtain ceteris pari- 
bus effects and interpreting partial effects in models with interaction terms, are covered in a simple 
fashion. 


Designed at Undergraduates, Applicable 
to Master’s Students 


The text is designed for undergraduate economics majors who have taken college algebra and 
one-semester of introductory probability and statistics. (Math Refresher A, B, and C contain the 
requisite background material.) A one-semester or one-quarter econometrics course would not be 
expected to cover all, or even any, of the more advanced material in Part 3. A typical introduc- 
tory course includes Chapters 1 through 8, which cover the basics of simple and multiple regres- 
sion for cross-sectional data. Provided the emphasis is on intuition and interpreting the empirical 
examples, the material from the first eight chapters should be accessible to undergraduates in most 
economics departments. Most instructors will also want to cover at least parts of the chapters 
on regression analysis with time series data, Chapters 10 and 12, in varying degrees of depth. 
In the one-semester course that I teach at Michigan State, I cover Chapter 10 fairly carefully, 
give an overview of the material in Chapter 11, and cover the material on serial correlation in 
Chapter 12. I find that this basic one-semester course puts students on a solid footing to write 
empirical papers, such as a term paper, a senior seminar paper, or a senior thesis. Chapter 9 
contains more specialized topics that arise in analyzing cross-sectional data, including data 
problems such as outliers and nonrandom sampling; for a one-semester course, it can be skipped 
without loss of continuity. 

The structure of the text makes it ideal for a course with a cross-sectional or policy analysis 
focus: the time series chapters can be skipped in lieu of topics from Chapters 9 or 15. The new mate- 
rial on potential outcomes added to the first nine chapters should help the instructor craft a course 
that provides an introduction to modern policy analysis. Chapter 13 is advanced only in the sense 
that it treats two new data structures: independently pooled cross sections and two-period panel data 
analysis. Such data structures are especially useful for policy analysis, and the chapter provides 
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several examples. Students with a good grasp of Chapters 1 through 8 will have little difficulty with 
Chapter 13. Chapter 14 covers more advanced panel data methods and would probably be covered 
only in a second course. A good way to end a course on cross-sectional methods is to cover the rudi- 
ments of instrumental variables estimation in Chapter 15. 

I have used selected material in Part 3, including Chapters 13 and 17, in a senior seminar geared 
to producing a serious research paper. Along with the basic one-semester course, students who have 
been exposed to basic panel data analysis, instrumental variables estimation, and limited dependent 
variable models are in a position to read large segments of the applied social sciences literature. 
Chapter 17 provides an introduction to the most common limited dependent variable models. 

The text is also well suited for an introductory master’s level course, where the emphasis is on 
applications rather than on derivations using matrix algebra. Several instructors have used the text to 
teach policy analysis at the master’s level. For instructors wanting to present the material in matrix 
form, Appendices D and E are self-contained treatments of the matrix algebra and the multiple regres- 
sion model in matrix form. 

At Michigan State, PhD students in many fields that require data analysis—including accounting, 
agricultural economics, development economics, economics of education, finance, international eco- 
nomics, labor economics, macroeconomics, political science, and public finance—have found the text 
to be a useful bridge between the empirical work that they read and the more theoretical econometrics 
they learn at the PhD level. 


Suggestions for Designing Your Course Beyond the Basic 


I have already commented on the contents of most of the chapters as well as possible outlines for 
courses. Here I provide more specific comments about material in chapters that might be covered or 
skipped: 

Chapter 9 has some interesting examples (such as a wage regression that includes IQ score as 
an explanatory variable). The rubric of proxy variables does not have to be formally introduced to 
present these kinds of examples, and I typically do so when finishing up cross-sectional analysis. In 
Chapter 12, for a one-semester course, I skip the material on serial correlation robust inference for 
ordinary least squares as well as dynamic models of heteroskedasticity. 

Even in a second course I tend to spend only a little time on Chapter 16, which covers simultane- 
ous equations analysis. I have found that instructors differ widely in their opinions on the importance 
of teaching simultaneous equations models to undergraduates. Some think this material is funda- 
mental; others think it is rarely applicable. My own view is that simultaneous equations models are 
overused (see Chapter 16 for a discussion). If one reads applications carefully, omitted variables and 
measurement error are much more likely to be the reason one adopts instrumental variables estimation, 
and this is why I use omitted variables to motivate instrumental variables estimation in Chapter 15. 
Still, simultaneous equations models are indispensable for estimating demand and supply functions, 
and they apply in some other important cases as well. 

Chapter 17 is the only chapter that considers models inherently nonlinear in their parameters, 
and this puts an extra burden on the student. The first material one should cover in this chapter is on 
probit and logit models for binary response. My presentation of Tobit models and censored regression 
still appears to be novel in introductory texts. I explicitly recognize that the Tobit model is applied to 
corner solution outcomes on random samples, while censored regression is applied when the data col- 
lection process censors the dependent variable at essentially arbitrary thresholds. 

Chapter 18 covers some recent important topics from time series econometrics, including test- 
ing for unit roots and cointegration. I cover this material only in a second-semester course at either 
the undergraduate or master’s level. A fairly detailed introduction to forecasting is also included in 
Chapter 18. 
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Chapter 19, which would be added to the syllabus for a course that requires a term paper, is much 
more extensive than similar chapters in other texts. It summarizes some of the methods appropriate 
for various kinds of problems and data structures, points out potential pitfalls, explains in some detail 
how to write a term paper in empirical economics, and includes suggestions for possible projects. 


What’s Changed? 


I have added new exercises to many chapters, including to the Math Refresher and Advanced 
Treatment appendices. Some of the new computer exercises use new data sets, including a data set 
on performance of men’s college basketball teams. I have also added more challenging problems that 
require derivations. 

There are several notable changes to the text. An important organizational change, which should 
facilitate a wider variety of teaching tastes, is that the notion of binary, or dummy, explanatory vari- 
ables is introduced in Chapter 2. There, it is shown that ordinary least squares estimation leads to a 
staple in basic statistics: the difference in means between two subgroups in a population. By introduc- 
ing qualitative factors into regression early on, the instructor is able to use a wider variety of empirical 
examples from the very beginning. 

The early discussion of binary explanatory variables allows for a formal introduction of potential, 
or counterfactual, outcomes, which is indispensable in the modern literature on estimating causal 
effects. The counterfactual approach to studying causality appears in previous editions, but Chapters 2, 
3, 4, and 7 now explicitly include new sections on the modern approach to causal inference. Because 
basic policy analysis involves the binary decision to participate in a program or not, a leading example 
of using dummy independent variables in simple and multiple regression is to evaluate policy inter- 
ventions. At the same time, the new material is incorporated into the text so that instructors not wish- 
ing to cover the potential outcomes framework may easily skip the material. Several end-of-chapter 
problems concern extensions of the basic potential outcomes framework, which should be valuable 
for instructors wishing to cover that material. 

Chapter 3 includes a new section on different ways that one can apply multiple regression, 
including problems of pure prediction, testing efficient markets, and culminating with a discussion of 
estimating treatment or causal effects. I think this section provides a nice way to organize students’ 
thinking about the scope of multiple regression after they have seen the mechanics of ordinary least 
squares (OS) and several examples. As with other new material that touches on causal effects, this 
material can be skipped without loss of continuity. A new section in Chapter 7 continues the discussion 
of potential outcomes, allowing for nonconstant treatment effects. The material is a nice illustration 
of estimating different regression functions for two subgroups from a population. New problems 
in this chapter that allow the student more experience in using full regression adjustment to estimate 
causal effects. 

One notable change to Chapter 9 is a more detailed discussion of using missing data indicators 
when data are missing on one or more of the explanatory variables. The assumptions underlying the 
method are discussed in more detail than in the previous edition. 

Chapter 12 has been reorganized to reflect a more modern treatment of the problem of serial 
correlation in the errors of time series regression models. The new structure first covers adjusting the 
OLS standard errors to allow general forms of serial correlation. Thus, the chapter outline now paral- 
lels that in Chapter 8, with the emphasis in both cases on OLS estimation but making inference robust 
to violation of standard assumptions. Correcting for serial correlation using generalized least squares 
now comes after OLS and the treatment of testing for serial correlation. 

The advanced chapters also include several improvements. Chapter 13 now discusses, at an acces- 
sible level, extensions of the standard difference-in-differences setup, allowing for multiple control 
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groups, multiple time periods, and even group-specific trends. In addition, the chapter includes a 
more detailed discussion of computing standard errors robust to serial correlation when using first- 
differencing estimation with panel data. 

Chapter 14 now provides more detailed discussions of several important issues in estimating 
panel data models by fixed effects, random effects, and correlated random effects (CRE). The CRE 
approach with missing data is discussed in more detail, as is how one accounts for general functional 
forms, such as squares and interactions, which are covered in the cross-sectional setting in Chapter 6. 
An expanded section on general policy analysis with panel data should be useful for courses with an 
emphasis on program interventions and policy evaluation. 

Chapter 16, which still covers simultaneous equations models, now provides an explicit link 
between the potential outcomes framework and specification of simultaneous equations models. 

Chapter 17 now includes a discussion of using regression adjustment for estimating causal (treat- 
ment) effects when the outcome variable has special features, such as when the outcome itself is a 
binary variable. Then, as the reader is asked to explore in a new problem, logit and probit models can 
be used to obtain more reliable estimates of average treatment effects by estimating separate models 
for each treatment group. 

Chapter 18 now provides more details about how one can compute a proper standard error for 
a forecast (as opposed to a prediction) interval. This should help the advanced reader understand in 
more detail the nature of the uncertainty in the forecast. 


About MindTap™ 


MindTap is an outcome-driven application that propels students from memorization to mastery. 
It’s the only platform that gives you complete ownership of your course. With it, you can challenge 
every student, build their confidence, and empower them to be unstoppable. 

Access Everything You Need In One Place. Cut down on prep with preloaded, organized course 
materials in MindTap. Teach more efficiently with interactive multimedia, assignments, quizzes and 
more. And give your students the power to read, listen and study on their phones, so they can learn on 
their terms. 

Empower Your Students To Reach Their Potential. Twelve distinct metrics give you actionable 
insights into student engagement. Identify topics troubling your entire class and instantly communi- 
cate with struggling students. And students can track their scores to stay motivated toward their goals. 
Together, you can accelerate progress. 

Your Course. Your Content. Only MindTap gives you complete control over your course. You 
have the flexibility to reorder textbook chapters, add your own notes and embed a variety of content 
including OER. Personalize course content to your students’ needs. They can even read your notes, 
add their own and highlight key text to aid their progress. 

A Dedicated Team, Whenever You Need Them. MindTap isn’t just a tool; it’s backed by a per- 
sonalized team eager to support you. Get help setting up your course and tailoring it to your specific 
objectives. You’ ll be ready to make an impact from day one. And, we’ll be right here to help you and 
your students throughout the semester—and beyond. 


Design Features 


In addition to the didactic material in the chapter, I have included two features to help students 
better understand and apply what they are learning. Each chapter contains many numbered examples. 
Several of these are case studies drawn from recently published papers. I have used my judgment to 
simplify the analysis, hopefully without sacrificing the main point. The “Going Further Questions” in 
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the chapter provide students an opportunity to “go further” in learning the material through analysis 
or application. Students will find immediate feedback for these questions in the end of the text. 

The end-of-chapter problems and computer exercises are heavily oriented toward empirical work, 
rather than complicated derivations. The students are asked to reason carefully based on what they 
have learned. The computer exercises often expand on the in-text examples. Several exercises use data 
sets from published works or similar data sets that are motivated by published research in economics 
and other fields. 

A pioneering feature of this introductory econometrics text is the extensive glossary. The short 
definitions and descriptions are a helpful refresher for students studying for exams or reading empiri- 
cal research that uses econometric methods. I have added and updated several entries for the seventh 
edition. 


Instructional Tools 


Cengage offers various supplements for instructors and students who use this book. I would like 
to thank the Subject Matter Expert team who worked on these supplements and made teaching and 
learning easy. 


C. Patrick Scott, Ph.D., Louisiana Tech University (R Videos and Computer exercise reviewer) 
Hisham Foad (Aplia Home work reviewer and Glossary) 

Kenneth H. Brown, Missouri State University (R Videos creator) 

Scott Kostyshak, University of Florida (R Videos reviewer) 

Ujwal Kharel (Test Bank and Adaptive Test Prep) 


Data Sets — Available in Six Formats 


With more than 100 data sets in six different formats, including Stata®, R, EViews®, Minitab®, 
Microsoft® Excel, and Text, the instructor has many options for problem sets, examples, and term 
projects. Because most of the data sets come from actual research, some are very large. Except for 
partial lists of data sets to illustrate the various data structures, the data sets are not reported in the 
text. This book is geared to a course where computer work plays an integral role. 


Updated Data Sets Handbook 


An extensive data description manual is also available online. This manual contains a list of data 
sources along with suggestions for ways to use the data sets that are not described in the text. This 
unique handbook, created by author Jeffrey M. Wooldridge, lists the source of all data sets for quick 
reference and how each might be used. Because the data book contains page numbers, it is easy to 
see how the author used the data in the text. Students may want to view the descriptions of each data 
set and it can help guide instructors in generating new homework exercises, exam problems, or term 
projects. The author also provides suggestions on improving the data sets in this detailed resource that 
is available on the book’s companion website at http://login.cengage.com and students can access it 
free at www.cengage.com. 
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Instructor’s Manual with Solutions 


REVISED INSTRUCTOR’S MANUAL WITH SOLUTIONS SAVES TIME IN PREPARATION 
AND GRADING. The online Instructor’s Manual with solutions contains answers to all exercises in 
this edition. Teaching tips provide suggestions for presenting each chapter’s material. The Instructor’s 
Manual also contains sources for each of the data files with suggestions for using the data to develop 
problem sets, exams, and term papers. The Instructor’s Manual is password-protected and available 
for download on the book’s companion website. 


Test Bank 


Cengage Testing, powered by Cognero® is a flexible, online system that allows you to import, 
edit, and manipulate content from the text’s test bank or elsewhere, including your own favorite test 
questions; create multiple test versions in an instant; and deliver tests from your LMS, your class- 
room, or wherever you want. 


PowerPoint Slides 


UPDATED POWERPOINT® SLIDES BRING LECTURES TO LIFE WHILE VISUALLY 
CLARIFYING CONCEPTS. Exceptional PowerPoint® presentation slides, created specifically for 
this edition, help you create engaging, memorable lectures. The slides are particularly useful for clari- 
fying advanced topics in Part 3. You can modify or customize the slides for your specific course. 
PowerPoint® slides are available for convenient download on the instructor-only, password-protected 
section of the book’s companion website. 


Scientific Word Slides 


UPDATED SCIENTIFIC WORD® SLIDES REINFORCE TEXT CONCEPTS AND LECTURE 
PRESENTATIONS. Created by the text author, this edition’s Scientific Word® slides reinforce the 
book’s presentation slides while highlighting the benefits of Scientific Word®, the application cre- 
ated by MacKichan software, Inc. for specifically composing mathematical, scientific and techni- 
cal documents using LaTeX typesetting. These slides are based on the author’s actual lectures and 
are available for convenient download on the password-protected section of the book’s companion 
website. 


Student Supplements 


Student Solutions Manual 


Now your student’s can maximize their study time and further their course success with this dynamic 
online resource. This helpful Solutions Manual includes detailed steps and solutions to odd-numbered 
problems as well as computer exercises in the text. This supplement is available as a free resource at 
www.cengagebrain.com. 
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Some of the changes I discussed earlier were driven by comments I received from people on this 
list, and I continue to mull over other specific suggestions made by one or more reviewers. 

Many students and teaching assistants, too numerous to list, have caught mistakes in earlier 
editions or have suggested rewording some paragraphs. I am grateful to them. 

As always, it was a pleasure working with the team at Cengage Learning. Michael Parthenakis, 
my longtime Product Manager, has learned very well how to guide me with a firm yet gentle hand. 
Anita Verma and Ethan Crist quickly mastered the difficult challenges of being the content and sub- 
ject matter expert team of a dense, technical textbook. Their careful reading of the manuscript and 
fine eye for detail have improved this seventh edition considerably. 

This book is dedicated to my family: Leslie, Edmund, and R.G. 
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The Nature 
of Econometrics 
and Economic Data 


hapter 1 discusses the scope of econometrics and raises general issues that arise in the 
application of econometric methods. Section 1-1 provides a brief discussion about the pur- 
pose and scope of econometrics and how it fits into economic analysis. Section 1-2 provides 
examples of how one can start with an economic theory and build a model that can be estimated 
using data. Section 1-3 examines the kinds of data sets that are used in business, economics, and 
other social sciences. Section 1-4 provides an intuitive discussion of the difficulties associated with 


inferring causality in the social sciences. 


1-1 What Is Econometrics? 


Imagine that you are hired by your state government to evaluate the effectiveness of a publicly 
funded job training program. Suppose this program teaches workers various ways to use computers in 
the manufacturing process. The 20-week program offers courses during nonworking hours. Any 
hourly manufacturing worker may participate, and enrollment in all or part of the program is volun- 
tary. You are to determine what, if any, effect the training program has on each worker’s subsequent 
hourly wage. 

Now, suppose you work for an investment bank. You are to study the returns on different invest- 
ment strategies involving short-term U.S. treasury bills to decide whether they comply with implied 
economic theories. 

The task of answering such questions may seem daunting at first. At this point, you may only 
have a vague idea of the kind of data you would need to collect. By the end of this introductory 
econometrics course, you should know how to use econometric methods to formally evaluate a job 
training program or to test a simple economic theory. 
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CHAPTER 1 The Nature of Econometrics and Economic Data 


Econometrics is based upon the development of statistical methods for estimating economic 
relationships, testing economic theories, and evaluating and implementing government and business 
policy. A common application of econometrics is the forecasting of such important macroeconomic 
variables as interest rates, inflation rates, and gross domestic product (GDP). Whereas forecasts of 
economic indicators are highly visible and often widely published, econometric methods can be used 
in economic areas that have nothing to do with macroeconomic forecasting. For example, we will 
study the effects of political campaign expenditures on voting outcomes. We will consider the effect 
of school spending on student performance in the field of education. In addition, we will learn how to 
use econometric methods for forecasting economic time series. 

Econometrics has evolved as a separate discipline from mathematical statistics because the for- 
mer focuses on the problems inherent in collecting and analyzing nonexperimental economic data. 
Nonexperimental data are not accumulated through controlled experiments on individuals, firms, 
or segments of the economy. (Nonexperimental data are sometimes called observational data, or 
retrospective data, to emphasize the fact that the researcher is a passive collector of the data.) 
Experimental data are often collected in laboratory environments in the natural sciences, but 
they are more difficult to obtain in the social sciences. Although some social experiments can be 
devised, it is often impossible, prohibitively expensive, or morally repugnant to conduct the kinds 
of controlled experiments that would be needed to address economic issues. We give some specific 
examples of the differences between experimental and nonexperimental data in Section 1-4. 

Naturally, econometricians have borrowed from mathematical statisticians whenever possible. 
The method of multiple regression analysis is the mainstay in both fields, but its focus and interpreta- 
tion can differ markedly. In addition, economists have devised new techniques to deal with the com- 
plexities of economic data and to test the predictions of economic theories. 


1-2 Steps in Empirical Economic Analysis 


Econometric methods are relevant in virtually every branch of applied economics. They come into 
play either when we have an economic theory to test or when we have a relationship in mind that has 
some importance for business decisions or policy analysis. An empirical analysis uses data to test a 
theory or to estimate a relationship. 

How does one go about structuring an empirical economic analysis? It may seem obvious, but 
it is worth emphasizing that the first step in any empirical analysis is the careful formulation of the 
question of interest. The question might deal with testing a certain aspect of an economic theory, or it 
might pertain to testing the effects of a government policy. In principle, econometric methods can be 
used to answer a wide range of questions. 

In some cases, especially those that involve the testing of economic theories, a formal economic 
model is constructed. An economic model consists of mathematical equations that describe various 
relationships. Economists are well known for their building of models to describe a vast array of 
behaviors. For example, in intermediate microeconomics, individual consumption decisions, subject 
to a budget constraint, are described by mathematical models. The basic premise underlying these 
models is utility maximization. The assumption that individuals make choices to maximize their 
well-being, subject to resource constraints, gives us a very powerful framework for creating tractable 
economic models and making clear predictions. In the context of consumption decisions, utility maxi- 
mization leads to a set of demand equations. In a demand equation, the quantity demanded of each 
commodity depends on the price of the goods, the price of substitute and complementary goods, the 
consumer’s income, and the individual’s characteristics that affect taste. These equations can form the 
basis of an econometric analysis of consumer demand. 

Economists have used basic economic tools, such as the utility maximization framework, to 
explain behaviors that at first glance may appear to be noneconomic in nature. A classic example is 
Becker’s (1968) economic model of criminal behavior. 
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Economic Model of Crime 


In a seminal article, Nobel Prize winner Gary Becker postulated a utility maximization framework to 
describe an individual’s participation in crime. Certain crimes have clear economic rewards, but most 
criminal behaviors have costs. The opportunity costs of crime prevent the criminal from participating 
in other activities such as legal employment. In addition, there are costs associated with the possibility 
of being caught and then, if convicted, the costs associated with incarceration. From Becker’s per- 
spective, the decision to undertake illegal activity is one of resource allocation, with the benefits and 
costs of competing activities taken into account. 

Under general assumptions, we can derive an equation describing the amount of time spent in 
criminal activity as a function of various factors. We might represent such a function as 


Y = f(x, X2, X3, X4, Xs, Xe, X7), [1.1] 
where 


y = hours spent in criminal activities, 

x, = “wage” for an hour spent in criminal activity, 
x, = hourly wage in legal employment, 

x, = income other than from crime or employment, 
x4 = probability of getting caught, 

x; = probability of being convicted if caught, 

Xs = expected sentence if convicted, and 

xX = age. 


Other factors generally affect a person’s decision to participate in crime, but the list above is rep- 
resentative of what might result from a formal economic analysis. As is common in economic theory, 
we have not been specific about the function fC) in (1.1). This function depends on an underlying util- 
ity function, which is rarely known. Nevertheless, we can use economic theory—or introspection—to 
predict the effect that each variable would have on criminal activity. This is the basis for an econometric 
analysis of individual criminal activity. 


Formal economic modeling is sometimes the starting point for empirical analysis, but it is more com- 
mon to use economic theory less formally, or even to rely entirely on intuition. You may agree that the deter- 
minants of criminal behavior appearing in equation (1.1) are reasonable based on common sense; we might 
arrive at such an equation directly, without starting from utility maximization. This view has some merit, 
although there are cases in which formal derivations provide insights that intuition can overlook. 

Next is an example of an equation that we can derive through somewhat informal reasoning. 


Job Training and Worker Productivity 


Consider the problem posed at the beginning of Section 1-1. A labor economist would like to examine 
the effects of job training on worker productivity. In this case, there is little need for formal economic 
theory. Basic economic understanding is sufficient for realizing that factors such as education, experi- 
ence, and training affect worker productivity. Also, economists are well aware that workers are paid 
commensurate with their productivity. This simple reasoning leads to a model such as 


wage = f(educ, exper, training), [1.2] 
where 
wage = hourly wage, 
educ = years of formal education, 
exper = years of workforce experience, and 


training = weeks spent in job training. 


Again, other factors generally affect the wage rate, but equation (1.2) captures the essence of the problem. 
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After we specify an economic model, we need to turn it into what we call an econometric model. 
Because we will deal with econometric models throughout this text, it is important to know how an 
econometric model relates to an economic model. Take equation (1.1) as an example. The form of the 
function /(-) must be specified before we can undertake an econometric analysis. A second issue con- 
cerning (1.1) is how to deal with variables that cannot reasonably be observed. For example, consider 
the wage that a person can earn in criminal activity. In principle, such a quantity is well defined, but it 
would be difficult if not impossible to observe this wage for a given individual. Even variables such as 
the probability of being arrested cannot realistically be obtained for a given individual, but at least we 
can observe relevant arrest statistics and derive a variable that approximates the probability of arrest. 
Many other factors affect criminal behavior that we cannot even list, let alone observe, but we must 
somehow account for them. 

The ambiguities inherent in the economic model of crime are resolved by specifying a particular 
econometric model: 


crime = By + B,wage + B,othinc + B3fregarr + B,freqconv 


+ Bsavgsen + Beage + u, [1.3] 
where 
crime = some measure of the frequency of criminal activity, 
wage = the wage that can be earned in legal employment, 
othinc = the income from other sources (assets, inheritance, and so on), 


freqarr = the frequency of arrests for prior infractions (to approximate the probability of arrest), 
freqconv = the frequency of conviction, and 
avgsen = the average sentence length after conviction. 


The choice of these variables is determined by the economic theory as well as data considerations. 
The term u contains unobserved factors, such as the wage for criminal activity, moral character, fam- 
ily background, and errors in measuring things like criminal activity and the probability of arrest. We 
could add family background variables to the model, such as number of siblings, parents’ education, 
and so on, but we can never eliminate u entirely. In fact, dealing with this error term or disturbance 
term is perhaps the most important component of any econometric analysis. 

The constants Bo, B),..., Bs are the parameters of the econometric model, and they describe the 
directions and strengths of the relationship between crime and the factors used to determine crime in 
the model. 

A complete econometric model for Example 1.2 might be 


wage = By + Byeduc + Byexper + B3training + u, [1.4] 


where the term u contains factors such as “innate ability,” quality of education, family background, 
and the myriad other factors that can influence a person’s wage. If we are specifically concerned 
about the effects of job training, then 8, is the parameter of interest. 

For the most part, econometric analysis begins by specifying an econometric model, without con- 
sideration of the details of the model’s creation. We generally follow this approach, largely because 
careful derivation of something like the economic model of crime is time consuming and can take us 
into some specialized and often difficult areas of economic theory. Economic reasoning will play a 
role in our examples, and we will merge any underlying economic theory into the econometric model 
specification. In the economic model of crime example, we would start with an econometric model 
such as (1.3) and use economic reasoning and common sense as guides for choosing the variables. 
Although this approach loses some of the richness of economic analysis, it is commonly and effec- 
tively applied by careful researchers. 

Once an econometric model such as (1.3) or (1.4) has been specified, various hypotheses of 
interest can be stated in terms of the unknown parameters. For example, in equation (1.3), we might 
hypothesize that wage, the wage that can be earned in legal employment, has no effect on criminal 
behavior. In the context of this particular econometric model, the hypothesis is equivalent to B, = 0. 
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An empirical analysis, by definition, requires data. After data on the relevant variables have been 
collected, econometric methods are used to estimate the parameters in the econometric model and to 
formally test hypotheses of interest. In some cases, the econometric model is used to make predic- 
tions in either the testing of a theory or the study of a policy’s impact. 

Because data collection is so important in empirical work, Section 1-3 will describe the kinds of 
data that we are likely to encounter. 


1-3 The Structure of Economic Data 


Economic data sets come in a variety of types. Whereas some econometric methods can be applied 
with little or no modification to many different kinds of data sets, the special features of some data 
sets must be accounted for or should be exploited. We next describe the most important data structures 
encountered in applied work. 


1-3a Cross-Sectional Data 


A cross-sectional data set consists of a sample of individuals, households, firms, cities, states, countries, 
or a variety of other units, taken at a given point in time. Sometimes, the data on all units do not cor- 
respond to precisely the same time period. For example, several families may be surveyed during 
different weeks within a year. In a pure cross-sectional analysis, we would ignore any minor timing 
differences in collecting the data. If a set of families was surveyed during different weeks of the same 
year, we would still view this as a cross-sectional data set. 

An important feature of cross-sectional data is that we can often assume that they have been 
obtained by random sampling from the underlying population. For example, if we obtain informa- 
tion on wages, education, experience, and other characteristics by randomly drawing 500 people from 
the working population, then we have a random sample from the population of all working people. 
Random sampling is the sampling scheme covered in introductory statistics courses, and it simplifies 
the analysis of cross-sectional data. A review of random sampling is contained in Math Refresher C. 

Sometimes, random sampling is not appropriate as an assumption for analyzing cross-sectional 
data. For example, suppose we are interested in studying factors that influence the accumulation of 
family wealth. We could survey a random sample of families, but some families might refuse to report 
their wealth. If, for example, wealthier families are less likely to disclose their wealth, then the result- 
ing sample on wealth is not a random sample from the population of all families. This is an illustra- 
tion of a sample selection problem, an advanced topic that we will discuss in Chapter 17. 

Another violation of random sampling occurs when we sample from units that are large relative to 
the population, particularly geographical units. The potential problem in such cases is that the popula- 
tion is not large enough to reasonably assume the observations are independent draws. For example, 
if we want to explain new business activity across states as a function of wage rates, energy prices, 
corporate and property tax rates, services provided, quality of the workforce, and other state charac- 
teristics, it is unlikely that business activities in states near one another are independent. It turns out 
that the econometric methods that we discuss do work in such situations, but they sometimes need to 
be refined. For the most part, we will ignore the intricacies that arise in analyzing such situations and 
treat these problems in a random sampling framework, even when it is not technically correct to do so. 

Cross-sectional data are widely used in economics and other social sciences. In economics, the 
analysis of cross-sectional data is closely aligned with the applied microeconomics fields, such as 
labor economics, state and local public finance, industrial organization, urban economics, demogra- 
phy, and health economics. Data on individuals, households, firms, and cities at a given point in time 
are important for testing microeconomic hypotheses and evaluating economic policies. 

The cross-sectional data used for econometric analysis can be represented and stored in comput- 
ers. Table 1.1 contains, in abbreviated form, a cross-sectional data set on 526 working individuals 
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TABLE 1.1 A Cross-Sectional Data Set on Wages and Other Individual Characteristics 


obsno wage educ exper female married 
1 3.10 11 2 1 0 
2 3.24 12 22 1 1 
3 3.00 11 2 0 0 
4 6.00 8 44 0 1 
5 5.30 12 7 0 1 
525 11.56 16 5 0 1 
526 3.50 14 5 1 0 


for the year 1976. (This is a subset of the data in the file WAGE1.) The variables include wage (in 
dollars per hour), educ (years of education), exper (years of potential labor force experience), female 
(an indicator for gender), and married (marital status). These last two variables are binary (zero-one) 
in nature and serve to indicate qualitative features of the individual (the person is female or not; the 
person is married or not). We will have much to say about binary variables in Chapter 7 and beyond. 

The variable obsno in Table 1.1 is the observation number assigned to each person in the sample. 
Unlike the other variables, it is not a characteristic of the individual. All econometrics and statistics 
software packages assign an observation number to each data unit. Intuition should tell you that, for 
data such as that in Table 1.1, it does not matter which person is labeled as observation 1, which per- 
son is called observation 2, and so on. The fact that the ordering of the data does not matter for econo- 
metric analysis is a key feature of cross-sectional data sets obtained from random sampling. 

Different variables sometimes correspond to different time periods in cross-sectional data sets. 
For example, to determine the effects of government policies on long-term economic growth, econo- 
mists have studied the relationship between growth in real per capita GDP over a certain period (say, 
1960 to 1985) and variables determined in part by government policy in 1960 (government consump- 
tion as a percentage of GDP and adult secondary education rates). Such a data set might be repre- 
sented as in Table 1.2, which constitutes part of the data set used in the study of cross-country growth 
rates by De Long and Summers (1991). 

The variable gpcrgdp represents average growth in real per capita GDP over the period 1960 
to 1985. The fact that govcons60 (government consumption as a percentage of GDP) and second60 


TABLE 1.2 A Data Set on Economic Growth Rates and Country Characteristics 


obsno country gpcrgdp govcons60 second60 
1 Argentina 0.89 9 32 
2 Austria 3.32 16 50 
3 Belgium 2.56 13 69 
4 Bolivia 1.24 18 12 


61 Zimbabwe 2.30 17 6 
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(percentage of adult population with a secondary education) correspond to the year 1960, while 
gpcrgdp is the average growth over the period from 1960 to 1985, does not lead to any special prob- 
lems in treating this information as a cross-sectional data set. The observations are listed alphabeti- 
cally by country, but nothing about this ordering affects any subsequent analysis. 


1-3b Time Series Data 


A time series data set consists of observations on a variable or several variables over time. Examples 
of time series data include stock prices, money supply, consumer price index, GDP, annual homicide 
rates, and automobile sales figures. Because past events can influence future events and lags in behav- 
ior are prevalent in the social sciences, time is an important dimension in a time series data set. Unlike 
the arrangement of cross-sectional data, the chronological ordering of observations in a time series 
conveys potentially important information. 

A key feature of time series data that makes them more difficult to analyze than cross-sectional 
data is that economic observations can rarely, if ever, be assumed to be independent across time. Most 
economic and other time series are related, often strongly related, to their recent histories. For example, 
knowing something about the GDP from last quarter tells us quite a bit about the likely range of the GDP 
during this quarter, because GDP tends to remain fairly stable from one quarter to the next. Although 
most econometric procedures can be used with both cross-sectional and time series data, more needs 
to be done in specifying econometric models for time series data before standard econometric methods 
can be justified. In addition, modifications and embellishments to standard econometric techniques have 
been developed to account for and exploit the dependent nature of economic time series and to address 
other issues, such as the fact that some economic variables tend to display clear trends over time. 

Another feature of time series data that can require special attention is the data frequency 
at which the data are collected. In economics, the most common frequencies are daily, weekly, 
monthly, quarterly, and annually. Stock prices are recorded at daily intervals (excluding Saturday and 
Sunday). The money supply in the U.S. economy is reported weekly. Many macroeconomic series are 
tabulated monthly, including inflation and unemployment rates. Other macro series are recorded less 
frequently, such as every three months (every quarter). GDP is an important example of a quarterly 
series. Other time series, such as infant mortality rates for states in the United States, are available 
only on an annual basis. 

Many weekly, monthly, and quarterly economic time series display a strong seasonal pattern, 
which can be an important factor in a time series analysis. For example, monthly data on housing 
starts differ across the months simply due to changing weather conditions. We will learn how to deal 
with seasonal time series in Chapter 10. 

Table 1.3 contains a time series data set obtained from an article by Castillo-Freeman and 
Freeman (1992) on minimum wage effects in Puerto Rico. The earliest year in the data set is the first 


TABLE 1.3 Minimum Wage, Unemployment, and Related Data for Puerto Rico 


obsno year avgmin avgcov prunemp prgnp 
1 1950 0.20 20.1 15.4 878.7 
2 1951 0.21 20.7 16.0 925.0 
3 1952 0.23 22.6 14.8 1015.9 
37 1986 3:35 58.1 18.9 4281.6 


38 1987 3.35 58.2 16.8 4496.7 
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observation, and the most recent year available is the last observation. When econometric methods are 
used to analyze time series data, the data should be stored in chronological order. 

The variable avgmin refers to the average minimum wage for the year, avgcov is the average cov- 
erage rate (the percentage of workers covered by the minimum wage law), prunemp is the unemploy- 
ment rate, and prgnp is the gross national product, in millions of 1954 dollars. We will use these data 
later in a time series analysis of the effect of the minimum wage on employment. 


1-3c Pooled Cross Sections 


Some data sets have both cross-sectional and time series features. For example, suppose that two 
cross-sectional household surveys are taken in the United States, one in 1985 and one in 1990. 
In 1985, a random sample of households is surveyed for variables such as income, savings, fam- 
ily size, and so on. In 1990, a new random sample of households is taken using the same survey 
questions. To increase our sample size, we can form a pooled cross section by combining the 
two years. 

Pooling cross sections from different years is often an effective way of analyzing the effects 
of a new government policy. The idea is to collect data from the years before and after a key policy 
change. As an example, consider the following data set on housing prices taken in 1993 and 1995, 
before and after a reduction in property taxes in 1994. Suppose we have data on 250 houses for 1993 
and on 270 houses for 1995. One way to store such a data set is given in Table 1.4. 

Observations | through 250 correspond to the houses sold in 1993, and observations 251 through 
520 correspond to the 270 houses sold in 1995. Although the order in which we store the data turns 
out not to be crucial, keeping track of the year for each observation is usually very important. This is 
why we enter year as a separate variable. 

A pooled cross section is analyzed much like a standard cross section, except that we often need 
to account for secular differences in the variables across the time. In fact, in addition to increasing the 
sample size, the point of a pooled cross-sectional analysis is often to see how a key relationship has 
changed over time. 


TABLE 1.4 Pooled Cross Sections: Two Years of Housing Prices 


obsno year hprice proptax sqrft bdrms bthrms 
1 1993 85,500 42 1600 3 2.0 
2 1993 67,300 36 1440 3 2.5 
3 1993 134,000 38 2000 4 25, 
250 1993 243,600 41 2600 4 3.0 
251 1995 65,000 16 1250 2 1.0 
252 1995 182,400 20 2200 4 2.0 
253 1995 97,500 15 1540 3 2.0 
520 1995 57,200 16 1100 2 1:5 
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1-3d Panel or Longitudinal Data 


A panel data (or longitudinal data) set consists of a time series for each cross-sectional member 
in the data set. As an example, suppose we have wage, education, and employment history for a set 
of individuals followed over a 10-year period. Or we might collect information, such as investment 
and financial data, about the same set of firms over a five-year time period. Panel data can also be 
collected on geographical units. For example, we can collect data for the same set of counties in the 
United States on immigration flows, tax rates, wage rates, government expenditures, and so on, for 
the years 1980, 1985, and 1990. 

The key feature of panel data that distinguishes them from a pooled cross section is that the same 
cross-sectional units (individuals, firms, or counties in the preceding examples) are followed over a 
given time period. The data in Table 1.4 are not considered a panel data set because the houses sold 
are likely to be different in 1993 and 1995; if there are any duplicates, the number is likely to be so 
small as to be unimportant. In contrast, Table 1.5 contains a two-year panel data set on crime and 
related statistics for 150 cities in the United States. 

There are several interesting features in Table 1.5. First, each city has been given a number from 
1 through 150. Which city we decide to call city 1, city 2, and so on, is irrelevant. As with a pure cross 
section, the ordering in the cross section of a panel data set does not matter. We could use the city 
name in place of a number, but it is often useful to have both. 

A second point is that the two years of data for city 1 fill the first two rows or observations, 
observations 3 and 4 correspond to city 2, and so on. Because each of the 150 cities has two rows of 
data, any econometrics package will view this as 300 observations. This data set can be treated as a 
pooled cross section, where the same cities happen to show up in each year. But, as we will see in 
Chapters 13 and 14, we can also use the panel structure to analyze questions that cannot be answered 
by simply viewing this as a pooled cross section. 

In organizing the observations in Table 1.5, we place the two years of data for each city adjacent 
to one another, with the first year coming before the second in all cases. For just about every practi- 
cal purpose, this is the preferred way for ordering panel data sets. Contrast this organization with the 
way the pooled cross sections are stored in Table 1.4. In short, the reason for ordering panel data as 
in Table 1.5 is that we will need to perform data transformations for each city across the two years. 

Because panel data require replication of the same units over time, panel data sets, especially 
those on individuals, households, and firms, are more difficult to obtain than pooled cross sections. 
Not surprisingly, observing the same units over time leads to several advantages over cross-sectional 
data or even pooled cross-sectional data. The benefit that we will focus on in this text is that having 


TABLE 1.5 A Two-Year Panel Data Set on City Crime Statistics 


obsno city year murders population unem police 
1 1 1986 5 350,000 8.7 440 
2 1 1990 8 359,200 7.2 471 
3 2 1986 2 64,300 5.4 75 
4 2 1990 1 65,100 55 75 
297 149 1986 10 260,700 9.6 286 
298 149 1990 6 245,000 9.8 334 
299 150 1986 25 543,000 4.3 520 
300 150 1990 32 546,200 52 493 
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multiple observations on the same units allows us to control for certain unobserved characteristics 
of individuals, firms, and so on. As we will see, the use of more than one observation can facilitate 
causal inference in situations where inferring causality would be very difficult if only a single cross 
section were available. A second advantage of panel data is that they often allow us to study the 
importance of lags in behavior or the result of decision making. This information can be significant 
because many economic policies can be expected to have an impact only after some time has passed. 

Most books at the undergraduate level do not contain a discussion of econometric methods for 
panel data. However, economists now recognize that some questions are difficult, if not impossible, 
to answer satisfactorily without panel data. As you will see, we can make considerable progress with 
simple panel data analysis, a method that is not much more difficult than dealing with a standard 
cross-sectional data set. 


1-3e A Comment on Data Structures 


Part 1 of this text is concerned with the analysis of cross-sectional data, because this poses the fewest con- 
ceptual and technical difficulties. At the same time, it illustrates most of the key themes of econometric 
analysis. We will use the methods and insights from cross-sectional analysis in the remainder of the text. 

Although the econometric analysis of time series uses many of the same tools as cross-sectional 
analysis, it is more complicated because of the trending, highly persistent nature of many economic 
time series. Examples that have been traditionally used to illustrate the manner in which econometric 
methods can be applied to time series data are now widely believed to be flawed. It makes little sense 
to use such examples initially, because this practice will only reinforce poor econometric practice. 
Therefore, we will postpone the treatment of time series econometrics until Part 2, when the impor- 
tant issues concerning trends, persistence, dynamics, and seasonality will be introduced. 

In Part 3, we will treat pooled cross sections and panel data explicitly. The analysis of indepen- 
dently pooled cross sections and simple panel data analysis are fairly straightforward extensions of 
pure cross-sectional analysis. Nevertheless, we will wait until Chapter 13 to deal with these topics. 


1-4 Causality, Ceteris Paribus, and Counterfactual Reasoning 


In most tests of economic theory, and certainly for evaluating public policy, the economist’s goal is 
to infer that one variable (such as education) has a causal effect on another variable (such as worker 
productivity). Simply finding an association between two or more variables might be suggestive, but 
unless causality can be established, it is rarely compelling. 

The notion of ceteris paribus—which means “other (relevant) factors being equal”—plays an 
important role in causal analysis. This idea has been implicit in some of our earlier discussion, par- 
ticularly Examples 1.1 and 1.2, but thus far we have not explicitly mentioned it. 

You probably remember from introductory economics that most economic questions are ceteris 
paribus by nature. For example, in analyzing consumer demand, we are interested in knowing the 
effect of changing the price of a good on its quantity demanded, while holding all other factors—such 
as income, prices of other goods, and individual tastes—fixed. If other factors are not held fixed, then 
we cannot know the causal effect of a price change on quantity demanded. 

Holding other factors fixed is critical for policy analysis as well. In the job training example 
(Example 1.2), we might be interested in the effect of another week of job training on wages, with 
all other components being equal (in particular, education and experience). If we succeed in holding 
all other relevant factors fixed and then find a link between job training and wages, we can conclude 
that job training has a causal effect on worker productivity. Although this may seem pretty simple, 
even at this early stage it should be clear that, except in very special cases, it will not be possible to 
literally hold all else equal. The key question in most empirical studies is: Have enough other factors 
been held fixed to make a case for causality? Rarely is an econometric study evaluated without raising 
this issue. 


CHAPTER 1 The Nature of Econometrics and Economic Data 11 


In most serious applications, the number of factors that can affect the variable of interest—such 
as criminal activity or wages—is immense, and the isolation of any particular variable may seem like 
a hopeless effort. However, we will eventually see that, when carefully applied, econometric methods 
can simulate a ceteris paribus experiment. 

The notion of ceteris paribus also can be described through counterfactual reasoning, which 
has become an organizing theme in analyzing various interventions, such as policy changes. The idea 
is to imagine an economic unit, such as an individual or a firm, in two or more different states of the 
world. For example, consider studying the impact of a job training program on workers’ earnings. 
For each worker in the relevant population, we can imagine what his or her subsequent earnings 
would be under two states of the world: having participated in the job training program and having 
not participated. By considering these counterfactual outcomes (also called potential outcomes), 
we easily “hold other factors fixed” because the counterfactual thought experiment applies to each 
individual separately. We can then think of causality as meaning that the outcome—in this case, labor 
earnings—in the two states of the world differs for at least some indiviuals. The fact that we will 
eventually observe each worker in only one state of the world raises important problems of estima- 
tion, but that is a separate issue from the issue of what we mean by causality. We formally introduce 
an apparatus for discussing counterfactual outcomes in Chapter 2. 

At this point, we cannot yet explain how econometric methods can be used to estimate ceteris 
paribus effects, so we will consider some problems that can arise in trying to infer causality in eco- 
nomics. We do not use any equations in this discussion. Instead, in each example, we will discuss 
what other factors we would like to hold fixed, and sprinkle in some counterfactual reasoning. For 
each example, inferring causality becomes relatively easy if we could conduct an appropriate experi- 
ment. Thus, it is useful to describe how such an experiment might be structured, and to observe that, 
in most cases, obtaining experimental data is impractical. It is also helpful to think about why the 
available data fail to have the important features of an experimental data set. 

We rely, for now, on your intuitive understanding of such terms as random, independence, and 
correlation, all of which should be familiar from an introductory probability and statistics course. 
(These concepts are reviewed in Math Refresher B.) We begin with an example that illustrates some 
of these important issues. 


Effects of Fertilizer on Crop Yield 


Some early econometric studies [for example, Griliches (1957)] considered the effects of new 
fertilizers on crop yields. Suppose the crop under consideration is soybeans. Because fertilizer amount 
is only one factor affecting yields—some others include rainfall, quality of land, and presence of 
parasites—this issue must be posed as a ceteris paribus question. One way to determine the causal effect 
of fertilizer amount on soybean yield is to conduct an experiment, which might include the following 
steps. Choose several one-acre plots of land. Apply different amounts of fertilizer to each plot and sub- 
sequently measure the yields; this gives us a cross-sectional data set. Then, use statistical methods (to 
be introduced in Chapter 2) to measure the association between yields and fertilizer amounts. 

As described earlier, this may not seem like a very good experiment because we have said noth- 
ing about choosing plots of land that are identical in all respects except for the amount of fertilizer. 
In fact, choosing plots of land with this feature is not feasible: some of the factors, such as land 
quality, cannot even be fully observed. How do we know the results of this experiment can be used 
to measure the ceteris paribus effect of fertilizer? The answer depends on the specifics of how fertil- 
izer amounts are chosen. If the levels of fertilizer are assigned to plots independently of other plot 
features that affect yield—that is, other characteristics of plots are completely ignored when deciding 
on fertilizer amounts—then we are in business. We will justify this statement in Chapter 2. 


The next example is more representative of the difficulties that arise when inferring causality in 
applied economics. 
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Measuring the Return to Education 


Labor economists and policy makers have long been interested in the “return to education.” Somewhat 
informally, the question is posed as follows: If a person is chosen from the population and given 
another year of education, by how much will his or her wage increase? As with the previous exam- 
ples, this is a ceteris paribus question, which implies that all other factors are held fixed while another 
year of education is given to the person. Notice the element of counterfactual reasoning here: we can 
imagine the wage of each individual varying with different levels of education, that is, in different 
states of the world. Eventually, we obtain data on each worker in only one state of the world: the 
education level they actually wound up with, through perhaps a complicated process of intellectual 
ability, motivation for learning, parental input, and societal influences. 

We can imagine a social planner designing an experiment to get at this issue, much as the agricul- 
tural researcher can design an experiment to estimate fertilizer effects. Assume, for the moment, that 
the social planner has the ability to assign any level of education to any person. How would this plan- 
ner emulate the fertilizer experiment in Example 1.3? The planner would choose a group of people 
and randomly assign each person an amount of education; some people are given an eighth-grade 
education, some are given a high school education, some are given two years of college, and so on. 
Subsequently, the planner measures wages for this group of people (where we assume that each per- 
son then works in a job). The people here are like the plots in the fertilizer example, where education 
plays the role of fertilizer and wage rate plays the role of soybean yield. As with Example 1.3, if levels 
of education are assigned independently of other characteristics that affect productivity (such as expe- 
rience and innate ability), then an analysis that ignores these other factors will yield useful results. 
Again, it will take some effort in Chapter 2 to justify this claim; for now, we state it without support. 


Unlike the fertilizer-yield example, the experiment described in Example 1.4 is unfeasible. The 
ethical issues, not to mention the economic costs, associated with randomly determining education 
levels for a group of individuals are obvious. As a logistical matter, we could not give someone only 
an eighth-grade education if he or she already has a college degree. 

Even though experimental data cannot be obtained for measuring the return to education, we can 
certainly collect nonexperimental data on education levels and wages for a large group by sampling 
randomly from the population of working people. Such data are available from a variety of surveys 
used in labor economics, but these data sets have a feature that makes it difficult to estimate the 
ceteris paribus return to education. People choose their own levels of education; therefore, education 
levels are probably not determined independently of all other factors affecting wage. This problem is 
a feature shared by most nonexperimental data sets. 

One factor that affects wage is experience in the workforce. Because pursuing more education 
generally requires postponing entering the workforce, those with more education usually have less 
experience. Thus, in a nonexperimental data set on wages and education, education is likely to be neg- 
atively associated with a key variable that also affects wage. It is also believed that people with more 
innate ability often choose higher levels of education. Because higher ability leads to higher wages, 
we again have a correlation between education and a critical factor that affects wage. 

The omitted factors of experience and ability in the wage example have analogs in the fertilizer 
example. Experience is generally easy to measure and therefore is similar to a variable such as rain- 
fall. Ability, on the other hand, is nebulous and difficult to quantify; it is similar to land quality in the 
fertilizer example. As we will see throughout this text, accounting for other observed factors, such as 
experience, when estimating the ceteris paribus effect of another variable, such as education, is rela- 
tively straightforward. We will also find that accounting for inherently unobservable factors, such as 
ability, is much more problematic. It is fair to say that many of the advances in econometric methods 
have tried to deal with unobserved factors in econometric models. 

One final parallel can be drawn between Examples 1.3 and 1.4. Suppose that in the fertilizer 
example, the fertilizer amounts were not entirely determined at random. Instead, the assistant who 
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chose the fertilizer levels thought it would be better to put more fertilizer on the higher-quality plots 
of land. (Agricultural researchers should have a rough idea about which plots of land are of bet- 
ter quality, even though they may not be able to fully quantify the differences.) This situation is 
completely analogous to the level of schooling being related to unobserved ability in Example 1.4. 
Because better land leads to higher yields, and more fertilizer was used on the better plots, any 
observed relationship between yield and fertilizer might be spurious. 

Difficulty in inferring causality can also arise when studying data at fairly high levels of aggregation, 
as the next example on city crime rates shows. 


The Effect of Law Enforcement on City Crime Levels 


The issue of how best to prevent crime has been, and will probably continue to be, with us for some 
time. One especially important question in this regard is: Does the presence of more police officers on 
the street deter crime? 

The ceteris paribus question is easy to state: If a city is randomly chosen and given, say, ten 
additional police officers, by how much would its crime rates fall? Closely related to this thought 
experiment is explicitly setting up counterfactual outcomes: For a given city, what would its crime 
rate be under varying sizes of the police force? Another way to state the question is: If two cities are 
the same in all respects, except that city A has ten more police officers than city B, by how much 
would the two cities’ crime rates differ? 

It would be virtually impossible to find pairs of communities identical in all respects except for 
the size of their police force. Fortunately, econometric analysis does not require this. What we do need 
to know is whether the data we can collect on community crime levels and the size of the police force 
can be viewed as experimental. We can certainly imagine a true experiment involving a large collec- 
tion of cities where we dictate how many police officers each city will use for the upcoming year. 

Although policies can be used to affect the size of police forces, we clearly cannot tell each city 
how many police officers it can hire. If, as is likely, a city’s decision on how many police officers to hire 
is correlated with other city factors that affect crime, then the data must be viewed as nonexperimental. 
In fact, one way to view this problem is to see that a city’s choice of police force size and the amount of 
crime are simultaneously determined. We will explicitly address such problems in Chapter 16. 


The first three examples we have discussed have dealt with cross-sectional data at various levels 
of aggregation (for example, at the individual or city levels). The same hurdles arise when inferring 
causality in time series problems. 


The Effect of the Minimum Wage on Unemployment 


An important, and perhaps contentious, policy issue concerns the effect of the minimum wage on 
unemployment rates for various groups of workers. Although this problem can be studied in a variety 
of data settings (cross-sectional, time series, or panel data), time series data are often used to look at 
aggregate effects. An example of a time series data set on unemployment rates and minimum wages 
was given in Table 1.3. 

Standard supply and demand analysis implies that, as the minimum wage is increased above 
the market clearing wage, we slide up the demand curve for labor and total employment decreases. 
(Labor supply exceeds labor demand.) To quantify this effect, we can study the relationship between 
employment and the minimum wage over time. In addition to some special difficulties that can arise 
in dealing with time series data, there are possible problems with inferring causality. The minimum 
wage in the United States is not determined in a vacuum. Various economic and political forces 
impinge on the final minimum wage for any given year. (The minimum wage, once determined, is 
usually in place for several years, unless it is indexed for inflation.) Thus, it is probable that the 
amount of the minimum wage is related to other factors that have an effect on employment levels. 
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We can imagine the U.S. government conducting an experiment to determine the employment 
effects of the minimum wage (as opposed to worrying about the welfare of low-wage workers). The 
minimum wage could be randomly set by the government each year, and then the employment out- 
comes could be tabulated. The resulting experimental time series data could then be analyzed using 
fairly simple econometric methods. But this scenario hardly describes how minimum wages are set. 

If we can control enough other factors relating to employment, then we can still hope to estimate 
the ceteris paribus effect of the minimum wage on employment. In this sense, the problem is very 
similar to the previous cross-sectional examples. 


Even when economic theories are not most naturally described in terms of causality, they often 
have predictions that can be tested using econometric methods. The following example demonstrates 
this approach. 


The Expectations Hypothesis 


The expectations hypothesis from financial economics states that, given all information available 
to investors at the time of investing, the expected return on any two investments is the same. For 
example, consider two possible investments with a three-month investment horizon, purchased at the 
same time: (1) Buy a three-month T-bill with a face value of $10,000, for a price below $10,000; in 
three months, you receive $10,000. (2) Buy a six-month T-bill (at a price below $10,000) and, in three 
months, sell it as a three-month T-bill. Each investment requires roughly the same amount of initial 
capital, but there is an important difference. For the first investment, you know exactly what the return 
is at the time of purchase because you know the initial price of the three-month T-bill, along with its 
face value. This is not true for the second investment: although you know the price of a six-month 
T-bill when you purchase it, you do not know the price you can sell it for in three months. Therefore, 
there is uncertainty in this investment for someone who has a three-month investment horizon. 

The actual returns on these two investments will usually be different. According to the expecta- 
tions hypothesis, the expected return from the second investment, given all information at the time of 
investment, should equal the return from purchasing a three-month T-bill. This theory turns out to be 
fairly easy to test, as we will see in Chapter 11. 


Summary 


In this introductory chapter, we have discussed the purpose and scope of econometric analysis. Econometrics 
is used in all applied economics fields to test economic theories, to inform government and private policy 
makers, and to predict economic time series. Sometimes, an econometric model is derived from a formal 
economic model, but in other cases, econometric models are based on informal economic reasoning and 
intuition. The goals of any econometric analysis are to estimate the parameters in the model and to test 
hypotheses about these parameters; the values and signs of the parameters determine the validity of an 
economic theory and the effects of certain policies. 

Cross-sectional, time series, pooled cross-sectional, and panel data are the most common types of data 
structures that are used in applied econometrics. Data sets involving a time dimension, such as time series and 
panel data, require special treatment because of the correlation across time of most economic time series. Other 
issues, such as trends and seasonality, arise in the analysis of time series data but not cross-sectional data. 

In Section 1-4, we discussed the notions of causality, ceteris paribus, and counterfactuals. In most 
cases, hypotheses in the social sciences are ceteris paribus in nature: all other relevant factors must be fixed 
when studying the relationship between two variables. As we discussed, one way to think of the ceteris 
paribus requirement is to undertake a thought experiment where the same economic unit operates in dif- 
ferent states of the world, such as different policy regimes. Because of the nonexperimental nature of most 
data collected in the social sciences, uncovering causal relationships is very challenging. 
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Key Terms 


Causal Effect Econometric Model Panel Data 

Ceteris Paribus Economic Model Pooled Cross Section 
Counterfactual Outcomes Empirical Analysis Random Sampling 
Counterfactual Reasoning Experimental Data Retrospective Data 
Cross-Sectional Data Set Nonexperimental Data Time Series Data 


Data Frequency 


Observational Data 


Problems 


1 Suppose that you are asked to conduct a study to determine whether smaller class sizes lead to 


improved student performance of fourth graders. 
(i) If you could conduct any experiment you want, what would you do? Be specific. 

(ii) More realistically, suppose you can collect observational data on several thousand fourth grad- 
ers in a given state. You can obtain the size of their fourth-grade class and a standardized test 
score taken at the end of fourth grade. Why might you expect a negative correlation between 
class size and test score? 

(iii) Would a negative correlation necessarily show that smaller class sizes cause better 
performance? Explain. 


A justification for job training programs is that they improve worker productivity. Suppose that you 

are asked to evaluate whether more job training makes workers more productive. However, rather than 

having data on individual workers, you have access to data on manufacturing firms in Ohio. In particu- 

lar, for each firm, you have information on hours of job training per worker (training) and number of 

nondefective items produced per worker hour (output). 

(i) Carefully state the ceteris paribus thought experiment underlying this policy question. 

(ii) Does it seem likely that a firm’s decision to train its workers will be independent of worker 
characteristics? What are some of those measurable and unmeasurable worker characteristics? 

(iii) Name a factor other than worker characteristics that can affect worker productivity. 

(iv) Ifyou find a positive correlation between output and training, would you have convincingly 
established that job training makes workers more productive? Explain. 


Suppose at your university you are asked to find the relationship between weekly hours spent study- 
ing (study) and weekly hours spent working (work). Does it make sense to characterize the problem as 
inferring whether study “causes” work or work “causes” study? Explain. 


States (and provinces) that have control over taxation sometimes reduce taxes in an attempt to spur 
economic growth. Suppose that you are hired by a state to estimate the effect of corporate tax rates on, 
say, the growth in per capita gross state product (GSP). 

(i) What kind of data would you need to collect to undertake a statistical analysis? 

(ii) Isit feasible to do a controlled experiment? What would be required? 
(iii) Is a correlation analysis between GSP growth and tax rates likely to be convincing? Explain. 


Computer Exercises 


C1 Use the data in WAGE] for this exercise. 


(i) Find the average education level in the sample. What are the lowest and highest years of education? 
(ii) Find the average hourly wage in the sample. Does it seem high or low? 
(iii) The wage data are reported in 1976 dollars. Using the Internet or a printed source, find the 
Consumer Price Index (CPI) for the years 1976 and 2013. 
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(iv) 


(v) 


Use the CPI values from part (iii) to find the average hourly wage in 2013 dollars. Now does 
the average hourly wage seem reasonable? 
How many women are in the sample? How many men? 


C2 Use the data in BWGHT to answer this question. 


© 
Gi) 


Gii) 
(iv) 


(v) 


How many women are in the sample, and how many report smoking during pregnancy? 

What is the average number of cigarettes smoked per day? Is the average a good measure of the 
“typical” woman in this case? Explain. 

Among women who smoked during pregnancy, what is the average number of cigarettes 
smoked per day? How does this compare with your answer from part (ii), and why? 

Find the average of fatheduc in the sample. Why are only 1,192 observations used to compute 
this average? 

Report the average family income and its standard deviation in dollars. 


C3 The data in MEAPO1 are for the state of Michigan in the year 2001. Use these data to answer the fol- 
lowing questions. 


(i) 
(ii) 
(iii) 
(iv) 
(v) 
(vi) 


(vii) 


Find the largest and smallest values of math4. Does the range make sense? Explain. 

How many schools have a perfect pass rate on the math test? What percentage is this of the 
total sample? 

How many schools have math pass rates of exactly 50%? 

Compare the average pass rates for the math and reading scores. Which test is harder to pass? 
Find the correlation between math4 and read4. What do you conclude? 

The variable exppp is expenditure per pupil. Find the average of exppp along with its standard 
deviation. Would you say there is wide variation in per pupil spending? 

Suppose School A spends $6,000 per student and School B spends $5,500 per student. By what 
percentage does School A’s spending exceed School B’s? Compare this to 100 - [log(6,000) — 
log(5,500)], which is the approximation percentage difference based on the difference in the 
natural logs. (See Section A.4 in Math Refresher A.) 


C4 The data in JTRAIN2 come from a job training experiment conducted for low-income men during 
1976-1977; see Lalonde (1986). 


(i) 
(ii) 


(iii) 


(iv) 


Use the indicator variable train to determine the fraction of men receiving job training. 

The variable re78 is earnings from 1978, measured in thousands of 1982 dollars. Find the 
averages of re78 for the sample of men receiving job training and the sample not receiving job 
training. Is the difference economically large? 

The variable unem78 is an indicator of whether a man is unemployed or not in 1978. What 
fraction of the men who received job training are unemployed? What about for men who did 
not receive job training? Comment on the difference. 

From parts (ii) and (iii), does it appear that the job training program was effective? What would 
make our conclusions more convincing? 


C5 The data in FERTIL2 were collected on women living in the Republic of Botswana in 1988. The vari- 
able children refers to the number of living children. The variable electric is a binary indicator equal to 
one if the woman’s home has electricity, and zero if not. 


(i) 
(ii) 
(iii) 


(iv) 


Find the smallest and largest values of children in the sample. What is the average of children? 
What percentage of women have electricity in the home? 

Compute the average of children for those without electricity and do the same for those with 
electricity. Comment on what you find. 

From part (iii), can you infer that having electricity “causes” women to have fewer children? 
Explain. 
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C6 Use the data in COUNTYMURDERS to answer this question. Use only the year 1996. The variable 
murders is the number of murders reported in the county. The variable execs is the number of execu- 
tions that took place of people sentenced to death in the given county. Most states in the United States 
have the death penalty, but several do not. 

(i) How many counties are there in the data set? Of these, how many have zero murders? What 
percentage of counties have zero executions? (Remember, use only the 1996 data.) 
(ii) What is the largest number of murders? What is the largest number of executions? Compute the 
average number of executions and explain why it is so small. 
(iii) Compute the correlation coefficient between murders and execs and describe what you find. 
(iv) You should have computed a positive correlation in part (iii). Do you think that more executions 
cause more murders to occur? What might explain the positive correlation? 


C7 The data set in ALCOHOL contains information on a sample of men in the United States. Two key 
variables are self-reported employment status and alcohol abuse (along with many other variables). 
The variables employ and abuse are both binary, or indicator, variables: they take on only the values 
zero and one. 

(i) What percentage of the men in the sample report abusing alcohol? What is the employment rate? 
(ii) Consider the group of men who abuse alcohol. What is the employment rate? 
Gii) What is the employment rate for the group of men who do not abuse alcohol? 
(iv) Discuss the difference in your answers to parts (ii) and (iii). Does this allow you to conclude 
that alcohol abuse causes unemployment? 


C8 The data in ECONMATH were obtained on students from a large university course in introductory 
microeconomics. For this problem, we are interested in two variables: score, which is the final course 
score, and econhs, which is a binary variable indicating whether a student took an economics course in 
high school. 

(i) How many students are in the sample? How many students report taking an economics course 

in high school? 

(ii) Find the average of score for those students who did take a high school economics class. How 
does it compare with the average of score for those who did not? 

(iii) Do the findings in part (ii) necessarily tell you anything about the causal effect of taking high 
school economics on college course peformance? Explain. 

(iv) If you want to obtain a good causal estimate of the effect of taking a high school economics 
course using the difference in averages, what experiment would you run? 


art 1 of the text covers regression analysis with cross-sectional data. It builds upon a solid 
base of college algebra and basic concepts in probability and statistics. Math Refresher A, B, 
and C contain complete reviews of these topics. 

Chapter 2 begins with the simple linear regression model, where we explain one variable in 
terms of another variable. Although simple regression is not widely used in applied econometrics, it 
is used occasionally and serves as a natural starting point because the algebra and interpretations 
are relatively straightforward. 

Chapters 3 and 4 cover the fundamentals of multiple regression analysis, where we allow more 
than one variable to affect the variable we are trying to explain. Multiple regression is still the most 
commonly used method in empirical research, and so these chapters deserve careful attention. 
Chapter 3 focuses on the algebra of the method of ordinary least squares (OLS), while also estab- 
lishing conditions under which the OLS estimator is unbiased and best linear unbiased. Chapter 4 
covers the important topic of statistical inference. 

Chapter 5 discusses the large sample, or asymptotic, properties of the OLS estimators. This 
provides justification of the inference procedures in Chapter 4 when the errors in a regression 
model are not normally distributed. Chapter 6 covers some additional topics in regression analysis, 
including advanced functional form issues, data scaling, prediction, and goodness-of-fit. Chapter 7 
explains how qualitative information can be incorporated into multiple regression models. 

Chapter 8 illustrates how to test for and correct the problem of heteroskedasticity, or noncon- 
stant variance, in the error terms. We show how the usual OLS statistics can be adjusted, and we 
also present an extension of OLS, known as weighted least squares, which explicitly accounts for 
different variances in the errors. Chapter 9 delves further into the very important problem of correla- 
tion between the error term and one or more of the explanatory variables. We demonstrate how the 
availability of a proxy variable can solve the omitted variables problem. In addition, we establish the 
bias and inconsistency in the OLS estimators in the presence of certain kinds of measurement errors 
in the variables. Various data problems are also discussed, including the problem of outliers. 


The Simple 
Regression Model 


he simple regression model can be used to study the relationship between two variables. For 

reasons we will see, the simple regression model has limitations as a general tool for empirical 

analysis. Nevertheless, it is sometimes appropriate as an empirical tool. Learning how to inter- 
pret the simple regression model is good practice for studying multiple regression, which we will do 
in subsequent chapters. 


2-1 Definition of the Simple Regression Model 
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Much of applied econometric analysis begins with the following premise: y and x are two variables, 
representing some population, and we are interested in “explaining y in terms of x,” or in “studying 
how y varies with changes in x.” We discussed some examples in Chapter 1, including: y is soybean 
crop yield and x is amount of fertilizer; y is hourly wage and x is years of education; and y is a com- 
munity crime rate and x is number of police officers. 

In writing down a model that will “explain y in terms of x,’ we must confront three issues. First, 
because there is never an exact relationship between two variables, how do we allow for other factors 
to affect y? Second, what is the functional relationship between y and x? And third, how can we be 
sure we are capturing a ceteris paribus relationship between y and x (if that is a desired goal)? 

We can resolve these ambiguities by writing down an equation relating y to x. A simple 
equation is 


Y = Bo + Bix + u. [2.1] 


Equation (2.1), which is assumed to hold in the population of interest, defines the simple linear 
regression model. It is also called the two-variable linear regression model or bivariate linear 
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TABLE 2.1 Terminology for Simple Regression 


Y X 
Dependent variable Independent variable 
Explained variable Explanatory variable 
Response variable Control variable 
Predicted variable Predictor variable 
Regressand Regressor 


regression model because it relates the two variables x and y. We now discuss the meaning of each 
of the quantities in equation (2.1). [Incidentally, the term “regression” has origins that are not espe- 
cially important for most modern econometric applications, so we will not explain it here. See Stigler 
(1986) for an engaging history of regression analysis.] 

When related by equation (2.1), the variables y and x have several different names used inter- 
changeably, as follows: y is called the dependent variable, the explained variable, the response 
variable, the predicted variable, or the regressand; x is called the independent variable, the 
explanatory variable, the control variable, the predictor variable, or the regressor. (The term 
covariate is also used for x.) The terms “dependent variable” and “independent variable” are fre- 
quently used in econometrics. But be aware that the label “independent” here does not refer to the 
statistical notion of independence between random variables (see Math Refresher B). 

The terms “explained” and “explanatory” variables are probably the most descriptive. “Response” 
and “control” are used mostly in the experimental sciences, where the variable x is under the experi- 
menter’s control. We will not use the terms “predicted variable” and “predictor,” although you some- 
times see these in applications that are purely about prediction and not causality. Our terminology for 
simple regression is summarized in Table 2.1. 

The variable u, called the error term or disturbance in the relationship, represents factors other 
than x that affect y. A simple regression analysis effectively treats all factors affecting y other than x as 
being unobserved. You can usefully think of u as standing for “unobserved.” 

Equation (2.1) also addresses the issue of the functional relationship between y and x. If the other 
factors in u are held fixed, so that the change in u is zero, Au = 0, then x has a linear effect on y: 


Ay = B,Ax if Au = 0. [2.2] 
Thus, the change in y is simply 8, multiplied by the change in x. This means that 6, is the slope 
parameter in the relationship between y and x, holding the other factors in u fixed; it is of primary 


interest in applied economics. The intercept parameter 8), sometimes called the constant term, also 
has its uses, although it is rarely central to an analysis. 


Soybean Yield and Fertilizer 


Suppose that soybean yield is determined by the model 
yield = By + B, fertilizer + u, [2.3] 


so that y = yield and x = fertilizer. The agricultural researcher is interested in the effect of fertilizer 
on yield, holding other factors fixed. This effect is given by 8,. The error term u contains factors such 
as land quality, rainfall, and so on. The coefficient 8, measures the effect of fertilizer on yield, hold- 
ing other factors fixed: Ayield = B,A fertilizer. 
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A Simple Wage Equation 


A model relating a person’s wage to observed education and other unobserved factors is 
wage = By + B,educ + u. [2.4] 


If wage is measured in dollars per hour and educ is years of education, then 6; measures the change 
in hourly wage given another year of education, holding all other factors fixed. Some of those factors 
include labor force experience, innate ability, tenure with current employer, work ethic, and numerous 
other things. 


The linearity of equation (2.1) implies that a one-unit change in x has the same effect on y, 
regardless of the initial value of x. This is unrealistic for many economic applications. For example, in 
the wage-education example, we might want to allow for increasing returns: the next year of educa- 
tion has a larger effect on wages than did the previous year. We will see how to allow for such pos- 
sibilities in Section 2-4. 

The most difficult issue to address is whether model (2.1) really allows us to draw ceteris paribus 
conclusions about how x affects y. We just saw in equation (2.2) that 6; does measure the effect of 
x on y, holding all other factors (in u) fixed. Is this the end of the causality issue? Unfortunately, no. 
How can we hope to learn in general about the ceteris paribus effect of x on y, holding other factors 
fixed, when we are ignoring all those other factors? 

Section 2-5 will show that we are only able to get reliable estimators of By and 6, from a random 
sample of data when we make an assumption restricting how the unobservable u is related to the 
explanatory variable x. Without such a restriction, we will not be able to estimate the ceteris paribus 
effect, B,. Because u and x are random variables, we need a concept grounded in probability. 

Before we state the key assumption about how x and u are related, we can always make one 
assumption about u. As long as the intercept Bp is included in the equation, nothing is lost by assum- 
ing that the average value of u in the population is zero. Mathematically, 


E(u) = 0. [2.5] 


Assumption (2.5) says nothing about the relationship between u and x, but simply makes a state- 
ment about the distribution of the unobserved factors in the population. Using the previous exam- 
ples for illustration, we can see that assumption (2.5) is not very restrictive. In Example 2.1, we 
lose nothing by normalizing the unobserved factors affecting soybean yield, such as land quality, to 
have an average of zero in the population of all cultivated plots. The same is true of the unobserved 
factors in Example 2.2. Without loss of generality, we can assume that things such as average 
ability are zero in the population of all working people. If you are not convinced, you should work 
through Problem 2 to see that we can always redefine the intercept in equation (2.1) to make equa- 
tion (2.5) true. 

We now turn to the crucial assumption regarding how u and x are related. A natural measure of 
the association between two random variables is the correlation coefficient. (See Math Refresher B 
for definition and properties.) If u and x are uncorrelated, then, as random variables, they are not 
linearly related. Assuming that u and x are uncorrelated goes a long way toward defining the sense in 
which u and x should be unrelated in equation (2.1). But it does not go far enough, because correla- 
tion measures only linear dependence between u and x. Correlation has a somewhat counterintuitive 
feature: it is possible for u to be uncorrelated with x while being correlated with functions of x, such 
as x’. (See Section B-4 in Math Refresher B for further discussion.) This possibility is not acceptable 
for most regression purposes, as it causes problems for interpreting the model and for deriving statis- 
tical properties. A better assumption involves the expected value of u given x. 

Because u and x are random variables, we can define the conditional distribution of u given any 
value of x. In particular, for any x, we can obtain the expected (or average) value of u for that slice of 
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the population described by the value of x. The crucial assumption is that the average value of u does 
not depend on the value of x. We can write this assumption as 


E(ulx) = E(u). [2.6] 


Equation (2.6) says that the average value of the unobservables is the same across all slices of the 
population determined by the value of x and that the common average is necessarily equal to the 
average of u over the entire population. When assumption (2.6) holds, we say that u is mean inde- 
pendent of x. (Of course, mean independence is implied by full independence between u and x, an 
assumption often used in basic probability and statistics.) When we combine mean independence 
with assumption (2.5), we obtain the zero conditional mean assumption, E(u|x) = 0. It is critical 
to remember that equation (2.6) is the assumption with impact; assumption (2.5) essentially defines 
the intercept, Bp. 

Let us see what equation (2.6) entails in the wage example. To simplify the discussion, assume 
that u is the same as innate ability. Then equation (2.6) requires that the average level of ability is the 
same, regardless of years of education. For example, if E(abil|8) denotes the average ability for the 
group of all people with eight years of education, and E(abil|16) denotes the average ability among 
people in the population with sixteen years of education, then equation (2.6) implies that these must 
be the same. In fact, the average ability level must be the same for all education levels. If, for exam- 
ple, we think that average ability increases with years of education, then equation (2.6) is false. (This 
would happen if, on average, people with more ability choose to become more educated.) As we can- 
not observe innate ability, we have no way of know- 
ing whether or not average ability is the same for all 
education levels. But this is an issue that we must 
address before relying on simple regression analysis. 

In the fertilizer example, if fertilizer amounts are 
chosen independently of other features of the plots, 
then equation (2.6) will hold: the average land quality 
will not depend on the amount of fertilizer. However, 
if more fertilizer is put on the higher-quality plots of 
land, then the expected value of u changes with the 
level of fertilizer, and equation (2.6) fails. 

The zero conditional mean assumption gives 6, another interpretation that is often useful. Taking 
the expected value of equation (2.1) conditional on x and using E(ul|x) = 0 gives 


E(y|x) = Bo + Bix. [2.8] 


Equation (2.8) shows that the population regression function (PRF), E(y|x), is a linear function of 
x. The linearity means that a one-unit increase in x changes the expected value of y by the amount f,. 
For any given value of x, the distribution of y is centered about E(y|x), as illustrated in Figure 2.1. 

It is important to understand that equation (2.8) tells us how the average value of y changes 
with x; it does not say that y equals By + Gx for all units in the population. For example, suppose 
that x is the high school grade point average and y is the college GPA, and we happen to know that 
E(colGPA|hsGPA) = 1.5 + 0.5 hsGPA. [Of course, in practice, we never know the population 
intercept and slope, but it is useful to pretend momentarily that we do to understand the nature of 
equation (2.8).] This GPA equation tells us the average college GPA among all students who have a 
given high school GPA. So suppose that hsGPA = 3.6. Then the average colGPA for all high school 
graduates who attend college with hsGPA = 3.6 is 1.5 + 0.5(3.6) = 3.3. We are certainly not say- 
ing that every student with hsGPA = 3.6 will have a 3.3 college GPA; this is clearly false. The PRF 
gives us a relationship between the average level of y at different levels of x. Some students with 
hsGPA = 3.6 will have a college GPA higher than 3.3, and some will have a lower college GPA. 
Whether the actual colGPA is above or below 3.3 depends on the unobserved factors in u, and those 
differ among students even within the slice of the population with hsGPA = 3.6. 


GOING FURTHER 2.1 


Suppose that a score on a final exam, score, 
depends on classes attended (attend) and 
unobserved factors that affect exam perfor- 
mance (such as student ability). Then 


score = By + B,attend + u. [2.7] 


When would you expect this model to satisfy 
equation (2.6)? 
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FIGURE 2.1 E(yix) as a linear function of x. 
y 


E(y|x) = By + Bx 


Given the zero conditional mean assumption E(u|x) = 0, it is useful to view equation (2.1) as 
breaking y into two components. The piece Bọ + Bx, which represents E(y|x), is called the system- 
atic part of y—that is, the part of y explained by x—and u is called the unsystematic part, or the part 
of y not explained by x. In Chapter 3, when we introduce more than one explanatory variable, we will 
discuss how to determine how large the systematic part is relative to the unsystematic part. 

In the next section, we will use assumptions (2.5) and (2.6) to motivate estimators of By and B, 
given a random sample of data. The zero conditional mean assumption also plays a crucial role in the 
statistical analysis in Section 2-5. 


2-2 Deriving the Ordinary Least Squares Estimates 


Now that we have discussed the basic ingredients of the simple regression model, we will address the 
important issue of how to estimate the parameters Bọ and B, in equation (2.1). To do this, we need a 


sample from the population. Let {(x;,y;):i = 1,...,} denote a random sample of size n from the 
population. Because these data come from equation (2.1), we can write 
Yi = Bo + Bix + u; [2.9] 


for each i. Here, u; is the error term for observation i because it contains all factors affecting y; other 
than x;. 

As an example, x; might be the annual income and y; the annual savings for family i during a par- 
ticular year. If we have collected data on 15 families, then n = 15. A scatterplot of such a data set is 
given in Figure 2.2, along with the (necessarily fictitious) population regression function. 

We must decide how to use these data to obtain estimates of the intercept and slope in the popula- 
tion regression of savings on income. 
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FIGURE 2.2 Scatterplot of savings and income for 15 families, and the population regression 


E( savings|income) = B, + B, income. 


savings 


E(savings|income) = B, + B,income 


income 


There are several ways to motivate the following estimation procedure. We will use equa- 
tion (2.5) and an important implication of assumption (2.6): in the population, u is uncorrelated with 
x. Therefore, we see that u has zero expected value and that the covariance between x and u is zero: 

E(u) = 0 [2.10] 

and 
Cov(x,u) = E(xu) = 0, [2.11] 
where the first equality in equation (2.11) follows from (2.10). (See Section B-4 in Math Refresher B 


for the definition and properties of covariance.) In terms of the observable variables x and y and the 
unknown parameters Bo and B,, equations (2.10) and (2.11) can be written as 


E(y — Bo — Bx) = 0 [2.12] 
and 


Elx(y — Bo — Bix)] = 0, [2.13] 
respectively. Equations (2.12) and (2.13) imply two restrictions on the joint probability distribution 
of (x,y) in the population. Because there are two unknown parameters to estimate, we might hope that 
equations (2.12) and (2.13) can be used to obtain good estimators of 6, and 6. In fact, they can be. 
Given a sample of data, we choose estimates Bo and Ê; to solve the sample counterparts of equations 
(2.12) and (2.13): 

n 0; = Bo - Bix) = 0 [2.14] 


i= 


and 
n'y, = Bo = Bx) = 0. [2.15] 


This is an example of the method of moments approach to estimation. (See Section C-4 in Math Refresher C 
for a discussion of different estimation approaches.) These equations can be solved for By and B,. 
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Using the basic properties of the summation operator from Math Refresher A, equation (2.14) 
can be rewritten as 


y= Bo + Bix, [2.16] 


where y = n '®;-1y; is the sample average of the y; and likewise for x. This equation allows us 
to write 6ọ in terms of B,, y, and x: 


Bo = y — Pix. [2.17] 
Therefore, once we have the slope estimate B |, it is straightforward to obtain the intercept estimate Bos 
given y and x. 


Dropping the n`! in (2.15) (because it does not affect the solution) and plugging (2.17) into 
(2.15) yields 


A 


Daily (y Bix) Bix] =0, 
which, upon rearrangement, gives 
Daily; T y) = Bi Dx = x). 
From basic properties of the summation operator [see (A-7) and (A-8) in Math Refresher A], 


E = x) = $ (x; = a) and Daly, y) = 3 DO; 2 


Therefore, provided that 
(x, — x)? >0, [2.18] 


the estimated slope is 


ĝi = = = [2.19] 


> (x; = x)? 


Equation (2.19) is simply the sample covariance between x; and y; divided by the sample variance 
of x; Using simple algebra we can also write B, as 


A A Oy 
By = Pry =h 
Oy 


where /,, is the sample correlation between x; and y; and ĉ,, Ô, denote the sample standard devia- 
tions. (See Math Refresher C for definitions of correlation and standard deviation. Dividing all sums 
by n — 1 does not affect the formulas.) An immediate implication is that if x; and y, are positively cor- 
related in the sample then B, > 0; if x; and y; are negatively correlated then B 1 <0. 

Not surprisingly, the formula for B , in terms of the sample correlation and sample standard devia- 
tions is the sample analog of the population relationship 


where all quantities are defined for the entire population. Recognition that 6, is just a scaled version 
of p,, highlights an important limitation of simple regression when we do not have experimental data: 
in effect, simple regression is an analysis of correlation between two variables, and so one must be 
careful in inferring causality. 

Although the method for obtaining (2.17) and (2.19) is motivated by (2.6), the only assumption 
needed to compute the estimates for a particular sample is (2.18). This is hardly an assumption at all: 
(2.18) is true, provided the x; in the sample are not all equal to the same value. If (2.18) fails, then 
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FIGURE 2.3 A scatterplot of wage against education when educ; = 12 for all i. 


wage 


0 12 educ 


we have either been unlucky in obtaining our sample from the population or we have not specified an 
interesting problem (x does not vary in the population). For example, if y = wage and x = educ, then 
(2.18) fails only if everyone in the sample has the same amount of education (for example, if everyone 
is a high school graduate; see Figure 2.3). If just one person has a different amount of education, then 
(2.18) holds, and the estimates can be computed. 

The estimates given in (2.17) and (2. 19) are called the ordinary least squares (OLS) estimates 
of bo and £. To justify this name, for any Bo and B | define a fitted value for y when x = x; as 


3; = Bo + Bux. [2.20] 


This is the value we predict for y when x = x; for the given intercept and slope. There is a fitted value 
for each observation in the sample. The residual for observation i is the difference between the actual 
y; and its fitted value: 


a= yi7 Ji = Yi T Bo + Pix [2.21] 


Again, there are n such residuals. [These are not the same as the errors in (2.9), a point we return to in 
Section 2-5.] The fitted values and residuals are indicated in Figure 2.4. 
Now, suppose we choose f, and 8, to make the sum of squared residuals, 


w= S (y - Bo — Bx), [2.22] 


as small as possible. The appendix to this chapter shows that the conditions necessary for ( Bos 1) to 
minimize (2.22) are given exactly by equations (2.14) and (2.15), without ni, Equations (2.14) and 
(2.15) are often called the first order conditions for the OLS estimates, a term that comes from opti- 
mization using calculus (see Math Refresher A). From our previous calculations, we know that the 
solutions to the OLS first order conditions are given by (2.17) and (2.19). The name “ordinary least 
squares” comes from the fact that these estimates minimize the sum of squared residuals. 
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FIGURE 2.4 Fitted values and residuals. 


y 


a; = residual 


y, = fitted value 


Yı 


When we view ordinary least squares as minimizing the sum of squared residuals, it is natural 
to ask: why not minimize some other function of the residuals, such as the absolute values of the 
residuals? In fact, as we will discuss in the more advanced Section 9-6, minimizing the sum of the 
absolute values of the residuals is sometimes very useful. But it does have some drawbacks. First, we 
cannot obtain formulas for the resulting estimators; given a data set, the estimates must be obtained 
by numerical optimization routines. As a consequence, the statistical theory for estimators that mini- 
mize the sum of the absolute residuals is very complicated. Minimizing other functions of the residu- 
als, say, the sum of the residuals each raised to the fourth power, has similar drawbacks. (We would 
never choose our estimates to minimize, say, the sum of the residuals themselves, as residuals large 
in magnitude but with opposite signs would tend to cancel out.) With OLS, we will be able to derive 
unbiasedness, consistency, and other important statistical properties relatively easily. Plus, as the 
motivation in equations (2.12) and (2.13) suggests, and as we will see in Section 2-5, OLS is suited 
for estimating the parameters appearing in the conditional mean function (2.8). 

Once we have determined the OLS intercept and slope estimates, we form the OLS regression line: 


$ = Bo + Bix, [2.23] 


where it is understood that Bo and Bi have been obtained using equations (2.17) and (2.19). The 
notation ĵ, read as “y hat,” emphasizes that the predicted values from equation (2.23) are estimates. 
The intercept, Bos is the predicted value of y when x = 0, although in some cases it will not make 
sense to set x = 0. In those situations, Bo is not, in itself, very interesting. When using (2.23) to com- 
pute predicted values of y for various values of x, we must account for the intercept in the calcula- 
tions. Equation (2.23) is also called the sample regression function (SRF) because it is the estimated 
version of the population regression function E(y|x) = By + Bx. It is important to remember that 
the PRF is something fixed, but unknown, in the population. Because the SRF is obtained for a given 
sample of data, a new sample will generate a different slope and intercept in equation (2.23). 
In most cases, the slope estimate, which we can write as 


A 


Bı = AS/Ax, [2.24] 
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is of primary interest. It tells us the amount by which ) changes when x increases by one unit. 
Equivalently, 


Ay = B,Ax, [2.25] 


so that given any change in x (whether positive or negative), we can compute the predicted change in y. 

We now present several examples of simple regression obtained by using real data. In other 
words, we find the intercept and slope estimates with equations (2.17) and (2.19). Because these 
examples involve many observations, the calculations were done using an econometrics software 
package. At this point, you should be careful not to read too much into these regressions; they are not 
necessarily uncovering a causal relationship. We have said nothing so far about the statistical proper- 
ties of OLS. In Section 2-5, we consider statistical properties after we explicitly impose assumptions 
on the population model equation (2.1). 


CEO Salary and Return on Equity 


For the population of chief executive officers, let y be annual salary (salary) in thousands of dol- 
lars. Thus, y = 856.3 indicates an annual salary of $856,300, and y = 1,452.6 indicates a salary of 
$1,452,600. Let x be the average return on equity (roe) for the CEO’s firm for the previous three 
years. (Return on equity is defined in terms of net income as a percentage of common equity.) For 
example, if roe = 10, then average return on equity is 10%. 

To study the relationship between this measure of firm performance and CEO compensation, we 
postulate the simple model 


salary = By + proe + u. 


The slope parameter 6, measures the change in annual salary, in thousands of dollars, when return 
on equity increases by one percentage point. Because a higher roe is good for the company, we think 
B, > 0. 

The data set CEOSAL]I contains information on 209 CEOs for the year 1990; these data were obtained 
from Business Week (5/6/91). In this sample, the average annual salary is $1,281,120, with the smallest and 
largest being $223,000 and $14,822,000, respectively. The average return on equity for the years 1988, 
1989, and 1990 is 17.18%, with the smallest and largest values being 0.5% and 56.3%, respectively. 

Using the data in CEOSALI, the OLS regression line relating salary to roe is 


salary = 963.191 + 18.501 roe [2.26] 
n = 209, 


where the intercept and slope estimates have been rounded to three decimal places; we use “salary 
hat” to indicate that this is an estimated equation. How do we interpret the equation? First, if the return 
on equity is zero, roe = 0, then the predicted salary is the intercept, 963.191, which equals $963,191 
because salary is measured in thousands. Next, we can write the predicted change in salary as a func- 
tion of the change in roe: Asalary = 18.501 (Aroe). This means that if the return on equity increases 
by one percentage point, Aroe = 1, then salary is predicted to change by about 18.5, or $18,500. 
Because (2.26) is a linear equation, this is the estimated change regardless of the initial salary. 

We can easily use (2.26) to compare predicted salaries at different values of roe. Suppose 
roe = 30. Then salary = 963.191 + 18.501(30) = 1,518,221, which is just over $1.5 million. 
However, this does not mean that a particular CEO whose firm had a roe = 30 earns $1,518,221. 
Many other factors affect salary. This is just our prediction from the OLS regression line (2.26). The 
estimated line is graphed in Figure 2.5, along with the population regression function E(salary|roe). 
We will never know the PRF, so we cannot tell how close the SRF is to the PRF. Another sample of data 
will give a different regression line, which may or may not be closer to the population regression line. 
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FIGURE 2.5 The OLS regression line salary = 963.191 + 18.501 roe and the (unknown) 


population regression function. 


salary 


Salary = 963.191 + 18.501 roe 


E(salary|roe) = B, + B,roe 


963.191 


roe 


Wage and Education 


For the population of people in the workforce in 1976, let y = wage, where wage is measured in dol- 
lars per hour. Thus, for a particular person, if wage = 6.75, the hourly wage is $6.75. Let x = educ 
denote years of schooling; for example, educ = 12 corresponds to a complete high school education. 
Because the average wage in the sample is $5.90, the Consumer Price Index indicates that this amount 
is equivalent to $24.90 in 2016 dollars. 

Using the data in WAGE1 where n = 526 individuals, we obtain the following OLS regression 
line (or sample regression function): 


wage = —0.90 + 0.54 educ [2.27] 
n = 526. 


We must interpret this equation with caution. The intercept of —0.90 literally means that a person 
with no education has a predicted hourly wage of —90¢ an hour. This, of course, is silly. It turns out 
that only 18 people in the sample of 526 have less than eight years of education. Consequently, it 
is not surprising that the regression line does poorly at very low levels of education. For a person 
with eight years of education, the predicted wage 
GOING FURTHER 2.2 is wage = —0.90 + 0.54(8) = 3.42, or $3.42 per 
hour (in 1976 dollars). 

The slope estimate in (2.27) implies that one 
more year of education increases hourly wage by 
54¢ an hour. Therefore, four more years of educa- 
tion increase the predicted wage by 4(0.54) = 2.16, 
or $2.16 per hour. These are fairly large effects. 


The estimated wage from (2.27), when 
educ = 8, is $3.42 in 1976 dollars. What 


is this value in 2016 dollars? (Hint: You 
have enough information in Example 2.4 to 
answer this question.) 
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Because of the linear nature of (2.27), another year of education increases the wage by the same 
amount, regardless of the initial level of education. In Section 2-4, we discuss some methods that 
allow for nonconstant marginal effects of our explanatory variables. 


Voting Outcomes and Campaign Expenditures 


The file VOTE1 contains data on election outcomes and campaign expenditures for 173 two-party 
races for the U.S. House of Representatives in 1988. There are two candidates in each race, A and 
B. Let voteA be the percentage of the vote received by Candidate A and shareA be the percentage of 
total campaign expenditures accounted for by Candidate A. Many factors other than shareA affect the 
election outcome (including the quality of the candidates and possibly the dollar amounts spent by A 
and B). Nevertheless, we can estimate a simple regression model to find out whether spending more 
relative to one’s challenger implies a higher percentage of the vote. 
The estimated equation using the 173 observations is 


voteA = 26.81 + 0.464 shareA [2.28] 
n = 173. 


This means that if Candidate A’s share of spending increases by one percentage point, Candidate A 
receives almost one-half a percentage point (0.464) more of the total vote. Whether or not this is a 
causal effect is unclear, but it is not unbelievable. If shareA = 50, voteA is predicted to be about 50, 
or half the vote. 


In some cases, regression analysis is not used to 
GOING FURTHER 2.3 determine causality but to simply look at whether two 
variables are positively or negatively related, much 
In Example 2.5, what is the predicted vote | like a standard correlation analysis. An example of 
this occurs in Computer Exercise C3, where you 
are asked to use data from Biddle and Hamermesh 
(1990) on time spent sleeping and working to investi- 
gate the tradeoff between these two factors. 


for Candidate A if shareA = 60 (which 
means 60%)? Does this answer seem 
reasonable? 


2-2a A Note on Terminology 


In most cases, we will indicate the estimation of a relationship through OLS by writing an equation 
such as (2.26), (2.27), or (2.28). Sometimes, for the sake of brevity, it is useful to indicate that an OLS 
regression has been run without actually writing out the equation. We will often indicate that equation 
(2.23) has been obtained by OLS in saying that we run the regression of 


yonx, [2.29] 


or simply that we regress y on x. The positions of y and x in (2.29) indicate which is the dependent 
variable and which is the independent variable: We always regress the dependent variable on the 
independent variable. For specific applications, we replace y and x with their names. Thus, to obtain 
(2.26), we regress salary on roe, or to obtain (2.28), we regress voteA on shareA. 

When we use such terminology in (2.29), we will always mean that we plan to estimate the 
intercept, Bos along with the slope, B ,. This case is appropriate for the vast majority of applications. 
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Occasionally, we may want to estimate the relationship between y and x assuming that the intercept 
is zero (so that x = 0 implies that } = 0); we cover this case briefly in Section 2-6. Unless explicitly 
stated otherwise, we always estimate an intercept along with a slope. 


2-3 Properties of OLS on Any Sample of Data 


In the previous section, we went through the algebra of deriving the formulas for the OLS intercept 
and slope estimates. In this section, we cover some further algebraic properties of the fitted OLS 
regression line. The best way to think about these properties is to remember that they hold, by con- 
struction, for any sample of data. The harder task—considering the properties of OLS across all pos- 
sible random samples of data—is postponed until Section 2-5. 

Several of the algebraic properties we are going to derive will appear mundane. Nevertheless, 
having a grasp of these properties helps us to figure out what happens to the OLS estimates and 
related statistics when the data are manipulated in certain ways, such as when the measurement units 
of the dependent and independent variables change. 


2-3a Fitted Values and Residuals 


We assume that the intercept and slope estimates, Bo and B p have been obtained for the given sample 
of data. Given Bo and B p we can obtain the fitted value },; for each observation. [This is given by equa- 
tion (2.20).] By definition, each fitted value of ĵ; is on the OLS regression line. The OLS residual 
associated with observation i, ii;, is the difference between y; and its fitted value, as given in equation 
(2.21). If ĝ; is positive, the line underpredicts y,; if #; is negative, the line overpredicts y,. The ideal 
case for observation i is when i; = 0, but in most cases, every residual is not equal to zero. In other 
words, none of the data points must actually lie on the OLS line. 


CEO Salary and Return on Equity 


Table 2.2 contains a listing of the first 15 observations in the CEO data set, along with the fitted 
values, called salaryhat, and the residuals, called uhat. 

The first four CEOs have lower salaries than what we predicted from the OLS regression 
line (2.26); in other words, given only the firm’s roe, these CEOs make less than what we 
predicted. As can be seen from the positive uhat, the fifth CEO makes more than predicted from 
the OLS regression line. 


2-3b Algebraic Properties of OLS Statistics 


There are several useful algebraic properties of OLS estimates and their associated statistics. We now 
cover the three most important of these. 
(1) The sum, and therefore the sample average of the OLS residuals, is zero. Mathematically, 


Sa =o. [2.30] 
E1 


This property needs no proof; it follows immediately from the OLS first order condition (2.14), when 
we remember that the residuals are defined by #; = y; — By — Bix; In other words, the OLS estimates 


CHAPTER 2 The Simple Regression Model 33 


TABLE 2.2 Fitted Values and Residuals for the First 15 CEOs 


obsno roe salary Salary a 
1 141 1095 1224.058 —129.0581 
2 10.9 1001 1164.854 -163.8542 
3 23:5 1122 1397.969 —275.9692 
4 5.9 578 1072.348 —494.3484 
5 13.8 1368 1218.508 149.4923 
6 20.0 1145 1333.215 —188.2151 
7 16.4 1078 1266.611 —188.6108 
8 16.3 1094 1264.761 —170.7606 
9 10.5 1237 1157.454 79.54626 
10 26.3 833 1449.773 —616.7726 
lil 25.9 567 1442.372 -875.3721 
12 26.8 933 1459.023 -526.0231 
13 14.8 1339 1237.009 101.9911 
14 22.3 937 1375.768 —438.7678 
15 56.3 2011 2004.808 6.191895 


Êo and A are chosen to make the residuals add up to zero (for any data set). This says nothing about 
the residual for any particular observation i. 

(2) The sample covariance between the regressors and the OLS residuals is zero. This follows 
from the first order condition (2.15), which can be written in terms of the residuals as 


Sxi = 0. [2.31] 


The sample average of the OLS residuals is zero, so the left-hand side of (2.31) is proportional to the 
sample covariance between x; and ij. 

(3) The point (x,y) is always on the OLS regression line. In other words, if we take equation 
(2.23) and plug in x for x, then the predicted value is y. This is exactly what equation (2.16) showed us. 


Wage and Education 


For the data in WAGE1, the average hourly wage in the sample is 5.90, rounded to two decimal 
places, and the average education is 12.56. If we plug educ = 12.56 into the OLS regression line 
(2.27), we get wage = —0.90 + 0.54(12.56) = 5.8824, which equals 5.9 when rounded to the first 
decimal place. These figures do not exactly agree because we have rounded the average wage and 
education, as well as the intercept and slope estimates. If we did not initially round any of the values, 
we would get the answers to agree more closely, but to little useful effect. 


Writing each y; as its fitted value, plus its residual, provides another way to interpret an OLS 
regression. For each i, write 


yi = f; + ti; [2.32] 
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From property (1), the average of the residuals is zero; equivalently, the sample average of the fitted 
values, ĵ;, is the same as the sample average of the y;, or } = y. Further, properties (1) and (2) can be 
used to show that the sample covariance between ĵ; and ii; is zero. Thus, we can view OLS as decom- 
posing each y; into two parts, a fitted value and a residual. The fitted values and residuals are uncor- 
related in the sample. 

Define the total sum of squares (SST), the explained sum of squares (SSE), and the residual 
sum of squares (SSR) (also known as the sum of squared residuals), as follows: 


SST = S(y,- 7}. [2.33] 
i=l 

SSE = $ (9; — y}? [2.34] 
i=1 

SSR = Di. [2.35] 


SST is a measure of the total sample variation in the y,; that is, it measures how spread out the y, are in 
the sample. If we divide SST byn — 1, we obtain the sample variance of y, as discussed in Math Refresher C. 
Similarly, SSE measures the sample variation in the ĵ; (where we use the fact that } = y), and SSR 
measures the sample variation in the i#;. The total variation in y can always be expressed as the sum of 
the explained variation and the unexplained variation SSR. Thus, 


SST = SSE + SSR. [2.36] 


Proving (2.36) is not difficult, but it requires us to use all of the properties of the summation operator 
covered in Math Refresher A. Write 


II 
> 
+ 
~~ 
< 

| 

SZ 
= 


= SSR + 2>4,(5, — y) + SSE. 


Now, (2.36) holds if we show that 


n 
Dil’; = 9) 0) [2.37] 
But we have already claimed that the sample covariance between the residuals and the fitted values is 
zero, and this covariance is just (2.37) divided by n — 1. Thus, we have established (2.36). 

Some words of caution about SST, SSE, and SSR are in order. There is no uniform agree- 
ment on the names or abbreviations for the three quantities defined in equations (2.33), (2.34), 
and (2.35). The total sum of squares is called either SST or TSS, so there is little confusion here. 
Unfortunately, the explained sum of squares is sometimes called the “regression sum of squares.” 
If this term is given its natural abbreviation, it can easily be confused with the term “residual sum 
of squares.” Some regression packages refer to the explained sum of squares as the “model sum of 
squares.” 

To make matters even worse, the residual sum of squares is often called the “error sum of 
squares.” This is especially unfortunate because, as we will see in Section 2-5, the errors and the 
residuals are different quantities. Thus, we will always call (2.35) the residual sum of squares or the 
sum of squared residuals. We prefer to use the abbreviation SSR to denote the sum of squared residu- 
als, because it is more common in econometric packages. 
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2-3c Goodness-of-Fit 


So far, we have no way of measuring how well the explanatory or independent variable, x, explains 
the dependent variable, y. It is often useful to compute a number that summarizes how well the OLS 
regression line fits the data. In the following discussion, be sure to remember that we assume that an 
intercept is estimated along with the slope. 

Assuming that the total sum of squares, SST, is not equal to zero—which is true except in 
the very unlikely event that all the y; equal the same value—we can divide (2.36) by SST to get 
1 = SSE/SST + SSR/SST. The R-squared of the regression, sometimes called the coefficient of 
determination, is defined as 


R? = SSE/SST = 1 — SSR/SST. [2.38] 


R? is the ratio of the explained variation compared to the total variation; thus, it is interpreted as the 
fraction of the sample variation in y that is explained by x. The second equality in (2.38) provides 
another way for computing R’. 

From (2.36), the value of R? is always between zero and one, because SSE can be no greater than 
SST. When interpreting R°, we usually multiply it by 100 to change it into a percent: 100 - R? is the 
percentage of the sample variation in y that is explained by x. 

If the data points all lie on the same line, OLS provides a perfect fit to the data. In this case, 
R? = 1. A value of R? that is nearly equal to zero indicates a poor fit of the OLS line: very little of 
the variation in the y; is captured by the variation in the ĵ; (which all lie on the OLS regression line). 
In fact, it can be shown that R? is equal to the square of the sample correlation coefficient between y; 
and ĵ;. This is where the term ““R-squared” came from. (The letter R was traditionally used to denote 
an estimate of a population correlation coefficient, and its usage has survived in regression analysis.) 


EXAMPLE 2.8 CEO Salary and Return on Equity 
In the CEO salary regression, we obtain the following: 
—_— ~~ 
salary = 963.191 + 18.501 roe [2.39] 
n = 209, R? = 0.0132. 


We have reproduced the OLS regression line and the number of observations for clarity. Using the 
R-squared (rounded to four decimal places) reported for this equation, we can see how much of 
the variation in salary is actually explained by the return on equity. The answer is: not much. The 
firm’s return on equity explains only about 1.3% of the variation in salaries for this sample of 209 
CEOs. That means that 98.7% of the salary variations for these CEOs is left unexplained! This lack 
of explanatory power may not be too surprising because many other characteristics of both the firm 
and the individual CEO should influence salary; these factors are necessarily included in the errors in 
a simple regression analysis. 


In the social sciences, low R-squareds in regression equations are not uncommon, especially 
for cross-sectional analysis. We will discuss this issue more generally under multiple regres- 
sion analysis, but it is worth emphasizing now that a seemingly low R-squared does not neces- 
sarily mean that an OLS regression equation is useless. It is still possible that (2.39) is a good 
estimate of the ceteris paribus relationship between salary and roe; whether or not this is true 
does not depend directly on the size of R-squared. Students who are first learning econometrics 
tend to put too much weight on the size of the R-squared in evaluating regression equations. For 
now, be aware that using R-squared as the main gauge of success for an econometric analysis 
can lead to trouble. 
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Sometimes, the explanatory variable explains a substantial part of the sample variation in the 
dependent variable. 


Voting Outcomes and Campaign Expenditures 


In the voting outcome equation in (2.28), R? = 0.856. Thus, the share of campaign expenditures 
explains over 85% of the variation in the election outcomes for this sample. This is a sizable portion. 


2-4 Units of Measurement and Functional Form 


Two important issues in applied economics are (1) understanding how changing the units of measure- 
ment of the dependent and/or independent variables affects OLS estimates and (2) knowing how to 
incorporate popular functional forms used in economics into regression analysis. The mathematics 
needed for a full understanding of functional form issues is reviewed in Math Refresher A. 


2-4a The Effects of Changing Units of Measurement 
on OLS Statistics 


In Example 2.3, we chose to measure annual salary in thousands of dollars, and the return on equity 
was measured as a percentage (rather than as a decimal). It is crucial to know how salary and roe are 
measured in this example in order to make sense of the estimates in equation (2.39). 

We must also know that OLS estimates change in entirely expected ways when the units 
of measurement of the dependent and independent variables change. In Example 2.3, suppose 
that, rather than measuring salary in thousands of dollars, we measure it in dollars. Let salardol 
be salary in dollars (salardol = 845,761 would be interpreted as $845,761). Of course, salardol 
has a simple relationship to the salary measured in thousands of dollars: salardol = 1,000: salary. 
We do not need to actually run the regression of salardol on roe to know that the estimated 
equation is: 


salardol = 963,191 + 18,501 roe. [2.40] 


We obtain the intercept and slope in (2.40) simply by multiplying the intercept and the slope in (2.39) 
by 1,000. This gives equations (2.39) and (2.40) the same interpretation. Looking at (2.40), if roe = 0, 
then salardol = 963,191, so the predicted salary is $963,191 [the same value we obtained from equa- 
tion (2.39)]. Furthermore, if roe increases by one, then the predicted salary increases by $18,501; 
again, this is what we concluded from our earlier analysis of equation (2.39). 

Generally, it is easy to figure out what happens to the intercept and slope estimates when 
the dependent variable changes units of measurement. If the dependent variable is multiplied 
by the constant c—which means each value in the sample is multiplied by c—then the OLS 
intercept and slope estimates are also multiplied by c. (This assumes nothing has changed 

about the independent variable.) In the CEO sal- 
ary example, c = 1,000 in moving from salary to 
GOING FURTHER 2.4 salardol. 

We can also use the CEO salary example to 
see what happens when we change the units of 
measurement of the independent variable. Define 
roedec = roe/100 to be the decimal equivalent of 


Suppose that salary is measured in hun- 
dreds of dollars, rather than in thousands of 


dollars, say, salarhun. What will be the OLS 


intercept and slope estimates in the regres- ; 
sion of salarhun on roe? roe; thus, roedec = 0.23 means a return on equity of 


23%. To focus on changing the units of measurement 
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of the independent variable, we return to our original dependent variable, salary, which is measured 
in thousands of dollars. When we regress salary on roedec, we obtain 


salary = 963.191 + 1,850.1 roedec. [2.41] 


The coefficient on roedec is 100 times the coefficient on roe in (2.39). This is as it should 
be. Changing roe by one percentage point is equivalent to Aroedec = 0.01. From (2.41), if 
Aroedec = 0.01, then Asalary = 1,850.1(0.01) = 18.501, which is what is obtained by using (2.39). 
Note that, in moving from (2.39) to (2.41), the independent variable was divided by 100, and so the 
OLS slope estimate was multiplied by 100, preserving the interpretation of the equation. Generally, 
if the independent variable is divided or multiplied by some nonzero constant, c, then the OLS slope 
coefficient is multiplied or divided by c, respectively. 

The intercept has not changed in (2.41) because roedec = 0 still corresponds to a zero return on 
equity. In general, changing the units of measurement of only the independent variable does not affect 
the intercept. 

In the previous section, we defined R-squared as a goodness-of-fit measure for OLS regression. 
We can also ask what happens to R? when the unit of measurement of either the independent or 
the dependent variable changes. Without doing any algebra, we should know the result: the 
goodness-of-fit of the model should not depend on the units of measurement of our variables. For 
example, the amount of variation in salary explained by the return on equity should not depend on 
whether salary is measured in dollars or in thousands of dollars or on whether return on equity is a 
percentage or a decimal. This intuition can be verified mathematically: using the definition of R?, it 
can be shown that R? is, in fact, invariant to changes in the units of y or x. 


2-4b Incorporating Nonlinearities in Simple Regression 


So far, we have focused on linear relationships between the dependent and independent variables. 
As we mentioned in Chapter 1, linear relationships are not nearly general enough for all economic 
applications. Fortunately, it is rather easy to incorporate many nonlinearities into simple regression 
analysis by appropriately defining the dependent and independent variables. Here, we will cover two 
possibilities that often appear in applied work. 

In reading applied work in the social sciences, you will often encounter regression equations 
where the dependent variable appears in logarithmic form. Why is this done? Recall the wage- 
education example, where we regressed hourly wage on years of education. We obtained a slope esti- 
mate of 0.54 [see equation (2.27)], which means that each additional year of education is predicted to 
increase hourly wage by 54 cents. Because of the linear nature of (2.27), 54 cents is the increase for 
either the first year of education or the twentieth year; this may not be reasonable. 

Probably a better characterization of how wage changes with education is that each year of edu- 
cation increases wage by a constant percentage. For example, an increase in education from 5 years to 
6 years increases wage by, say, 8% (ceteris paribus), and an increase in education from 11 to 12 years 
also increases wage by 8%. A model that gives (approximately) a constant percentage effect is 


log(wage) = By + Byeduc + u, [2.42] 


where log(-) denotes the natural logarithm. (See Math Refresher A for a review of logarithms.) In 
particular, if Au = 0, then 


% Awage = (100 + B,)Aeduc. [2.43] 


Notice how we multiply 8, by 100 to get the percentage change in wage given one additional 
year of education. Because the percentage change in wage is the same for each additional year of 
education, the change in wage for an extra year of education increases as education increases; in 
other words, (2.42) implies an increasing return to education. By exponentiating (2.42), we can write 
wage = exp(By + B,educ + u). This equation is graphed in Figure 2.6, with u = 0. 
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FIGURE 2.6 wage = exp (Bo + G,educ), with B, > 0. 


wage 


(0) educ 


A Log Wage Equation 


Using the same data as in Example 2.4, but using log(wage) as the dependent variable, we obtain the 
following relationship: 


n Nat 
log(wage) = 0.584 + 0.083 educ [2.44] 
n = 526, R? = 0.186. 


The coefficient on educ has a percentage interpretation when it is multiplied by 100: wage increases 
by 8.3% for every additional year of education. This is what economists mean when they refer to the 
“return to another year of education.” 

It is important to remember that the main reason for using the log of wage in (2.42) is to impose a 
constant percentage effect of education on wage. Once equation (2.44) is obtained, the natural log of 
wage is rarely mentioned. In particular, it is not correct to say that another year of education increases 
log(wage) by 8.3%. 

The intercept in (2.44) is not very meaningful, because it gives the predicted log(wage), when 
educ = 0. The R-squared shows that educ explains about 18.6% of the variation in log(wage) (not 
wage). Finally, equation (2.44) might not capture all of the nonlinearity in the relationship between 
wage and schooling. If there are “diploma effects,” then the twelfth year of education—graduation 
from high school—could be worth much more than the eleventh year. We will learn how to allow for 
this kind of nonlinearity in Chapter 7. 


Estimating a model such as (2.42) is straightforward when using simple regression. Just define the 
dependent variable, y, to be y = log(wage). The independent variable is represented by x = educ. The 
mechanics of OLS are the same as before: the intercept and slope estimates are given by the formulas 
(2.17) and (2.19). In other words, we obtain Bo and Êi from the OLS regression of log(wage) on educ. 

Another important use of the natural log is in obtaining a constant elasticity model. 
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CEO Salary and Firm Sales 


We can estimate a constant elasticity model relating CEO salary to firm sales. The data set is the same 
one used in Example 2.3, except we now relate salary to sales. Let sales be annual firm sales, mea- 
sured in millions of dollars. A constant elasticity model is 


log(salary) = By + Bylog(sales) + u, [2.45] 


where 3, is the elasticity of salary with respect to sales. This model falls under the simple regression 
model by defining the dependent variable to be y = log(salary) and the independent variable to be 
x = log(sales). Estimating this equation by OLS gives 


log(salary) = 4.822 + 0.257 log(sales) [2.46] 
n = 209, R? = 0.211. 


The coefficient of log(sales) is the estimated elasticity of salary with respect to sales. It implies that 
a 1% increase in firm sales increases CEO salary by about 0.257%—the usual interpretation of an 
elasticity. 


The two functional forms covered in this section will often arise in the remainder of this text. We 


have covered models containing natural logarithms here because they appear so frequently in applied 
work. The interpretation of such models will not be much different in the multiple regression case. 

It is also useful to note what happens to the intercept and slope estimates if we change 
the units of measurement of the dependent variable when it appears in logarithmic form. 
Because the change to logarithmic form approximates a proportionate change, it makes sense 
that nothing happens to the slope. We can see this by writing the rescaled variable as c,y; for each 
observation i. The original equation is log(y;) = By + Bix; + u;. If we add log(c,) to both sides, 
we get log(c,) + log(y;) = [log(c;) + Bo] + Bix; + u; or log(ciy;) = [log(c,) + Bo] + Bix; + u; 
(Remember that the sum of the logs is equal to the log of their product, as shown in Math Refresher A.) 
Therefore, the slope is still 6), but the intercept is now log(c,) + Bo. Similarly, if the independent 
variable is log(x), and we change the units of measurement of x before taking the log, the slope 
remains the same, but the intercept changes. You will be asked to verify these claims in Problem 9. 

We end this subsection by summarizing four combinations of functional forms available from 
using either the original variable or its natural log. In Table 2.3, x and y stand for the variables in their 
original form. The model with y as the dependent variable and x as the independent variable is called 
the level-level model because each variable appears in its level form. The model with log(y) as the 
dependent variable and x as the independent variable is called the /og-level model. We will not explic- 
itly discuss the /evel-log model here, because it arises less often in practice. In any case, we will see 
examples of this model in later chapters. 

The last column in Table 2.3 gives the interpretation of 6. In the log-level model, 100 ° 6, is 
sometimes called the semi-elasticity of y with respect to x. As we mentioned in Example 2.11, in the 
log-log model, 8, is the elasticity of y with respect to x. Table 2.3 warrants careful study, as we will 
refer to it often in the remainder of the text. 


TABLE 2.3 Summary of Functional Forms Involving Logarithms 


Model Dependent Variable Independent Variable Interpretation of B, 
Level-level y x Ay = B,Ax 
Level-log y log(x) Ay = (B;/100)%Ax 
Log-level log(y) x YAy = (100B;)Ax 
Log-log log(y) log(x) %Ay = By %Ax 
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2-4c The Meaning of “Linear” Regression 


The simple regression model that we have studied in this chapter is also called the simple 
linear regression model. Yet, as we have just seen, the general model also allows for certain 
nonlinear relationships. So what does “linear” mean here? You can see by looking at equation (2.1) 
that y = By + Bx + u. The key is that this equation is linear in the parameters By and B,. There are 
no restrictions on how y and x relate to the original explained and explanatory variables of interest. 
As we saw in Examples 2.10 and 2.11, y and x can be natural logs of variables, and this is quite com- 
mon in applications. But we need not stop there. For example, nothing prevents us from using simple 
regression to estimate a model such as cons = By + B,Vinc + u, where cons is annual consumption 
and inc is annual income. 

Whereas the mechanics of simple regression do not depend on how y and x are defined, the 
interpretation of the coefficients does depend on their definitions. For successful empirical work, it 
is much more important to become proficient at interpreting coefficients than to become efficient at 
computing formulas such as (2.19). We will get much more practice with interpreting the estimates in 
OLS regression lines when we study multiple regression. 

Plenty of models cannot be cast as a linear regression model because they are not linear in their 
parameters; an example is cons = 1/(Bọ + B,inc) + u. Estimation of such models takes us into the 
realm of the nonlinear regression model, which is beyond the scope of this text. For most applica- 
tions, choosing a model that can be put into the linear regression framework is sufficient. 


2-5 Expected Values and Variances of the OLS Estimators 


In Section 2-1, we defined the population model y = Sọ + x + u, and we claimed that the key 
assumption for simple regression analysis to be useful is that the expected value of u given any value 
of x is zero. In Sections 2-2, 2-3, and 2-4, we discussed the algebraic properties of OLS estimation. 
We now return to the population model and study the statistical properties of OLS. In other words, we 
now view Bo and B | as estimators for the parameters By and £; that appear in the population model. 
This means that we will study properties of the distributions of Bo and B, over different random 
samples from the population. (Math Refresher C contains definitions of estimators and reviews some 
of their important properties.) 


2-5a Unbiasedness of OLS 


We begin by establishing the unbiasedness of OLS under a simple set of assumptions. For future ref- 
erence, it is useful to number these assumptions using the prefix “SLR” for simple linear regression. 
The first assumption defines the population model. 


Assumption SLR.1 Linear in Parameters 


In the population model, the dependent variable, y, is related to the independent variable, x, and the 
error (or disturbance), u, as 


Y= Bo + Bx +u, [2.47] 


where By and 8, are the population intercept and slope parameters, respectively. 


To be realistic, y, x, and u are all viewed as random variables in stating the population model. We dis- 
cussed the interpretation of this model at some length in Section 2-1 and gave several examples. In the 
previous section, we learned that equation (2.47) is not as restrictive as it initially seems; by choosing 
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y and x appropriately, we can obtain interesting nonlinear relationships (such as constant elasticity 
models). 

We are interested in using data on y and x to estimate the parameters By and, especially, 8;. We 
assume that our data were obtained as a random sample. (See Math Refresher C for a review of ran- 
dom sampling.) 


Assumption SLR.2 Random Sampling 


We have a random sample of size n, {(x, y):/ = 1, 2,...,n}, following the population model in 
equation (2.47). 


We will have to address failure of the random sampling assumption in later chapters that deal with 
time series analysis and sample selection problems. Not all cross-sectional samples can be viewed as 
outcomes of random samples, but many can be. 

We can write (2.47) in terms of the random sample as 


y= Po + Bixi +t u, i= 1,2,...,n, [2.48] 


where u; is the error or disturbance for observation i (for example, person i, firm i, city i, and so on). 
Thus, u; contains the unobservables for observation i that affect y;. The u; should not be confused with 
the residuals, ii;, that we defined in Section 2-3. Later on, we will explore the relationship between the 
errors and the residuals. For interpreting By) and 6; in a particular application, (2.47) is most informa- 
tive, but (2.48) is also needed for some of the statistical derivations. 

The relationship (2.48) can be plotted for a particular outcome of data as shown in Figure 2.7. 

As we already saw in Section 2-2, the OLS slope and intercept estimates are not defined unless 
we have some sample variation in the explanatory variable. We now add variation in the x; to our list 
of assumptions. 


FIGURE 2.7 Graph of y; = By + B,X; + u; 


X, X; x 
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Assumption SLR.3 Sample Variation in the Explanatory Variable 


The sample outcomes on x, namely, {x, i = 1, ..., n}, are not all the same value. 


This is a very weak assumption—certainly not worth emphasizing, but needed nevertheless. If x 
varies in the population, random samples on x will typically contain variation, unless the population 
variation is minimal or the sample size is small. Simple inspection of summary statistics on x; reveals 
whether Assumption SLR.3 fails: if the sample standard deviation of x; is zero, then Assumption 
SLR.3 fails; otherwise, it holds. 

Finally, in order to obtain unbiased estimators of By and 64, we need to impose the zero condi- 
tional mean assumption that we discussed in some detail in Section 2-1. We now explicitly add it to 
our list of assumptions. 


Assumption SLR.4 Zero Conditional Mean 


The error u has an expected value of zero given any value of the explanatory variable. In other words, 


E(ulx) = 0. 


For a random sample, this assumption implies that E(u;|x,) = 0, for alli = 1,2,...,n. 

In addition to restricting the relationship between u and x in the population, the zero conditional 
mean assumption—coupled with the random sampling assumption—allows for a convenient technical 
simplification. In particular, we can derive the statistical properties of the OLS estimators as conditional 
on the values of the x; in our sample. Technically, in statistical derivations, conditioning on the sample 
values of the independent variable is the same as treating the x; as fixed in repeated samples, which 


we think of as follows. We first choose n sample values for x), x2, ...,X,- (These can be repeated.) 
Given these values, we then obtain a sample on y (effectively by obtaining a random sample of the u;). 
Next, another sample of y is obtained, using the same values for x,, X2, ..., x,. Then another sample 
of y is obtained, again using the same x,, X2, ..., x,. And so on. 


The fixed-in-repeated-samples scenario is not very realistic in nonexperimental contexts. For 
instance, in sampling individuals for the wage-education example, it makes little sense to think 
of choosing the values of educ ahead of time and then sampling individuals with those particular 
levels of education. Random sampling, where individuals are chosen randomly and their wage and 
education are both recorded, is representative of how most data sets are obtained for empirical 
analysis in the social sciences. Once we assume that E(u|x) = 0, and we have random sampling, 
nothing is lost in derivations by treating the x; as nonrandom. The danger is that the fixed-in- 
repeated-samples assumption always implies that u; and x; are independent. In deciding when sim- 
ple regression analysis is going to produce unbiased estimators, it is critical to think in terms of 
Assumption SLR.4. 

Now, we are ready to show that the OLS estimators are unbiased. To this end, we use the fact 
that iil; — x)(y; — y) = D3, (x; — x)y; (see Math Refresher A) to write the OLS slope estima- 
tor in equation (2.19) as 


bis [2.49] 


Because we are now interested in the behavior of Ê ı across all possible samples, B, is properly viewed 
as a random variable. 
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We can write Ê; in terms of the population coefficient and errors by substituting the right-hand 
side of (2.48) into (2.49). We have 


d(x; = x)y; $ (x; = X)(Bo + Bix; + u;) 


x i=1 i 
= = 2.50 
Bi SST, SST, , ieee 


where we have defined the total variation in x, as SST, = >/_ ,(x; — x) to simplify the notation. 
(This is not quite the sample variance of the x; because we do not divide by n — 1.) Using the algebra 
of the summation operator, write the numerator of 8, as 


2 (x = X)Bo + 2 (x > x) Bix F È (r = x)u; 


n n 


= Bod Cx: =x) F Bid (x; = 3) t > (x, = x)u; [2.51] 
As shown in Math Refresher A, ¥j=,(x; — x) = 0 and X7_,(x; — x)x; = Xi; — x) = SST, 
Therefore, wecan writethenumeratorofB,asB,SST, + >/_ (x; — x)u;.Puttingthisoverthedenominator 
gives 


5 i=1 


B, = Bi + 


= B, + (1/SST S du. 2.52 
SST, By ( /SS 1) 2 dit [ 52] 


where d; = x; — x. We now see that the estimator Ê ı equals the population slope, 8,, plus a term that 
is a linear combination in the errors [u], up, . . . , u„]. Conditional on the values of x;, the randomness 
in Êi is due entirely to the errors in the sample. The fact that these errors are generally different from 
zero is what causes Êi to differ from B;. 

Using the representation in (2.52), we can prove the first important statistical property of OLS. 


UNBIASEDNESS OF OLS: 
Using Assumptions SLR.1 through SLR.4, 
E(Bo) = Bo and E(B) = Bi, [2.53] 


for any values of 6) and B,. In other words, Bp is unbiased for Bo, and Ê; is unbiased for 64. 


PROOF: In this proof, the expected values are conditional on the sample values of the independent 
variable. Because SST, and d; are functions only of the x, they are nonrandom in the conditioning. 
Therefore, from (2.52), and keeping the conditioning on {x;, Xo, ..., Xp} implicit, we have 


E(B) = Bı + EL(1/SST,) Sul] = 6, + (1/SST,) È Eldu) 


= By + (1/SST,) $d E(U) = By + (1/SST,) $d; + 0 = By, 


a 


where we have used the fact that the expected value of each u; (conditional on {x,, X2, ..., X,}) is 
zero under Assumptions SLR.2 and SLR.4. Because unbiasedness holds for any outcome on 
{X;, Xo, ..., Xn}, unbiasedness also holds without conditioning on {Xx,, Xo, ..., Xp}- 

The proof for B, is now straightforward. Average (2.48) across i to get Y = Bo + 6X + U, and plug 
this into the formula for Bo: 


Bo = Y — BX = Bo X + — BX = By + (B, — B,)xX + 
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Then, conditional on the values of the x;, 


E(Bo) = Bo + E(B, — B,)X] + E(U) = Bo + El(B, — B,)]x, 


because E(U) = 0 by Assumptions SLR.2 and SLR.4. But, we showed that E(@;) = B4, which implies 
that E[(8, — B,)] = 0. Thus, E(B) = Bo. Both of these arguments are valid for any values of By and 
B,, and so we have established unbiasedness. 


Remember that unbiasedness is a feature of the sampling distributions of B ı and Bo» which says noth- 
ing about the estimate that we obtain for a given sample. We hope that, if the sample we obtain is somehow 
“typical,” then our estimate should be “near” the population value. Unfortunately, it is always possible that 
we could obtain an unlucky sample that would give us a point estimate far from 6}, and we can never 
know for sure whether this is the case. You may want to review the material on unbiased estimators in Math 
Refresher C, especially the simulation exercise in Table C.1 that illustrates the concept of unbiasedness. 

Unbiasedness generally fails if any of our four assumptions fail. This means that it is important to 
think about the veracity of each assumption for a particular application. Assumption SLR.1 requires 
that y and x be linearly related, with an additive disturbance. This can certainly fail. But we also know 
that y and x can be chosen to yield interesting nonlinear relationships. Dealing with the failure of 
(2.47) requires more advanced methods that are beyond the scope of this text. 

Later, we will have to relax Assumption SLR.2, the random sampling assumption, for time series 
analysis. But what about using it for cross-sectional analysis? Random sampling can fail in a cross 
section when samples are not representative of the underlying population; in fact, some data sets are 
constructed by intentionally oversampling different parts of the population. We will discuss problems 
of nonrandom sampling in Chapters 9 and 17. 

As we have already discussed, Assumption SLR.3 almost always holds in interesting regression 
applications. Without it, we cannot even obtain the OLS estimates. 

The assumption we should concentrate on for now is SLR.4. If SLR.4 holds, the OLS estimators 
are unbiased. Likewise, if SLR.4 fails, the OLS estimators generally will be biased. There are ways to 
determine the likely direction and size of the bias, which we will study in Chapter 3. 

The possibility that x is correlated with u is almost always a concern in simple regression analy- 
sis with nonexperimental data, as we indicated with several examples in Section 2-1. Using simple 
regression when u contains factors affecting y that are also correlated with x can result in spurious 
correlation: that is, we find a relationship between y and x that is really due to other unobserved fac- 
tors that affect y and also happen to be correlated with x. 


Student Math Performance and the School Lunch Program 


Let math10 denote the percentage of tenth graders at a high school receiving a passing score on a stan- 
dardized mathematics exam. Suppose we wish to estimate the effect of the federally funded school 
lunch program on student performance. If anything, we expect the lunch program to have a positive 
ceteris paribus effect on performance: all other factors being equal, if a student who is too poor to eat 
regular meals becomes eligible for the school lunch program, his or her performance should improve. 
Let /nchprg denote the percentage of students who are eligible for the lunch program. Then, a simple 
regression model is 


mathl0 = By + B,lnchprg + u, [2.54] 


where u contains school and student characteristics that affect overall school performance. Using the 
data in MEAP93 on 408 Michigan high schools for the 1992-1993 school year, we obtain 


—_— —_ 
math10 = 32.14 — 0.319 Inchprg 
n = 408, R = 0.171. 
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This equation predicts that if student eligibility in the lunch program increases by 10 percentage 
points, the percentage of students passing the math exam falls by about 3.2 percentage points. Do 
we really believe that higher participation in the lunch program actually causes worse performance? 
Almost certainly not. A better explanation is that the error term u in equation (2.54) is correlated with 
Inchprg. In fact, u contains factors such as the poverty rate of children attending school, which affects 
student performance and is highly correlated with eligibility in the lunch program. Variables such as 
school quality and resources are also contained in u, and these are likely correlated with /nchprg. It 
is important to remember that the estimate —0.319 is only for this particular sample, but its sign and 
magnitude make us suspect that u and x are correlated, so that simple regression is biased. 


In addition to omitted variables, there are other reasons for x to be correlated with u in the simple 
regression model. Because the same issues arise in multiple regression analysis, we will postpone a 
systematic treatment of the problem until then. 


2-5b Variances of the OLS Estimators 


In addition to knowing that the sampling distribution of B ı is centered about B, Ê ı is unbiased), it is 
important to know how far we can expect B ı to be away from £; on average. Among other things, this 
allows us to choose the best estimator among all, or at least a broad class of, unbiased estimators. The 
measure of spread in the distribution of By (and Bo) that is easiest to work with is the variance or its 
square root, the standard deviation. (See Math Refresher C for a more detailed discussion.) 

It turns out that the variance of the OLS estimators can be computed under Assumptions SLR. 1 
through SLR.4. However, these expressions would be somewhat complicated. Instead, we add an 
assumption that is traditional for cross-sectional analysis. This assumption states that the variance of 
the unobservable, u, conditional on x, is constant. This is known as the homoskedasticity or “con- 
stant variance” assumption. 


Assumption SLR.5 Homoskedasticity 


The error u has the same variance given any value of the explanatory variable. In other words, 


2 


Var(ulx) =o 


We must emphasize that the homoskedasticity assumption is quite distinct from the zero con- 
ditional mean assumption, E(u|x) = 0. Assumption SLR.4 involves the expected value of u, while 
Assumption SLR.5 concerns the variance of u (both conditional on x). Recall that we established the 
unbiasedness of OLS without Assumption SLR.5: the homoskedasticity assumption plays no role in 
showing that Bo and B ı are unbiased. We add Assumption SLR.S5 because it simplifies the variance 
calculations for Bo and Bı and because it implies that ordinary least squares has certain efficiency 
properties, which we will see in Chapter 3. If we were to assume that u and x are independent, then 
the distribution of u given x does not depend on x, and so E(u|x) = E(u) = 0 and Var(ulx) = o°. 
But independence is sometimes too strong of an assumption. 

Because Var(ulx) = E(w’|x) — [E(ulx) P and E(ulx) = 0, 0? = E(u’|x), which means øg? is 
also the unconditional expectation of u’. Therefore, o° = E(u’) = Var(u), because E(u) = 0. In 
other words, g? is the unconditional variance of u, and so ø? is often called the error variance or 
disturbance variance. The square root of a°, o, is the standard deviation of the error. A larger o means 
that the distribution of the unobservables affecting y is more spread out. 
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FIGURE 2.8 The simple regression model under homoskedasticity. 


f(y|x) 


It is often useful to write Assumptions SLR.4 and SLR.5 in terms of the conditional mean and 
conditional variance of y: 


E(ylx) = Bo + Bix. [2.55] 


Var(ylx) = o°. [2.56] 


In other words, the conditional expectation of y given x is linear in x, but the variance of y given x is 
constant. This situation is graphed in Figure 2.8 where 6, > 0 and B, > 0. 

When Var(u|x) depends on x, the error term is said to exhibit heteroskedasticity (or nonconstant 
variance). Because Var(ulx) = Var(y|x), heteroskedasticity is present whenever Var(y|x) is a func- 
tion of x. 


Heteroskedasticity in a Wage Equation 


In order to get an unbiased estimator of the ceteris paribus effect of educ on wage, we must assume that 
E(uleduc) = 0, and this implies E(wageleduc) = By + B,educ. If we also make the homoskedastic- 
ity assumption, then Var(u|educ) = o° does not depend on the level of education, which is the same 
as assuming Var(wageleduc) = o°. Thus, while average wage is allowed to increase with education 
level—it is this rate of increase that we are interested in estimating—the variability in wage about its 
mean is assumed to be constant across all education levels. This may not be realistic. It is likely that 
people with more education have a wider variety of interests and job opportunities, which could lead 
to more wage variability at higher levels of education. People with very low levels of education have 
fewer opportunities and often must work at the minimum wage; this serves to reduce wage variability 
at low education levels. This situation is shown in Figure 2.9. Ultimately, whether Assumption SLR.5 
holds is an empirical issue, and in Chapter 8 we will show how to test Assumption SLR.5. 
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FIGURE 2.9 Var(wageleduc) increasing with educ. 


f(wageleduc) 


wage 


a 
8 A Pe E(wageleduc) = 
12 d Bo + B,educ 


educ 


With the homoskedasticity assumption in place, we are ready to prove the following: 


SAMPLING VARIANCES OF THE OLS ESTIMATORS 
Under Assumptions SLR.1 through SLR.5, 


Var(B,) = 


Var( Bo) = ——— 


where these are conditional on the sample values {x;, ... , Xn}. 


PROOF: We derive the formula for Var(B; ), leaving the other derivation as Problem 10. The starting point 
is equation (2.52): B, = B, + (1/SST,) =_,du;. Because £; is just a constant, and we are condition- 
ing on the x, SST, and a; = x; — X are also nonrandom. Furthermore, because the u; are independent 
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random variables across / (by random sampling), the variance of the sum is the sum of the variances. 
Using these facts, we have 


Var(B1) = (vsst,)*Var( Sau) = (vsst,)°( Sof ver(u) ) 


i=1 i=1 


n 


(vsst,)"( Zoro”) [because Var(u;) = a? for all i] 


i=1 
o*(/sst,)°( Sef) = 0°(1/SST,)°SST, = 0°/SST,, 
i=1 


which is what we wanted to show. 


Equations (2.57) and (2.58) are the “standard” formulas for simple regression analysis, which 
are invalid in the presence of heteroskedasticity. This will be important when we turn to confidence 
intervals and hypothesis testing in multiple regression analysis. 

For most purposes, we are interested in Var(ĝ;). It is easy to summarize how this variance 

depends on the error variance, a°, and the total variation in Tij; Me ees Xat, SST,. First, the larger the 
error variance, the larger is Var( B 1). This makes sense because more variation in the unobservables 
affecting y makes it more difficult to precisely estimate B,. On the other hand, more variability in the 
independent variable is preferred: as the variability in the x; increases, the variance of B ı decreases. 
This also makes intuitive sense because the more spread out is the sample of independent variables, 
the easier it is to trace out the relationship between E(y|x) and x; that is, the easier it is to estimate B,. 
If there is little variation in the x; then it can be hard to pinpoint how E(y|x) varies with x. As the 
sample size increases, so does the total variation in the x;. Therefore, a larger sample size results in a 
smaller variance for Bi. 
This analysis shows that, if we are interested in 
GOING FURTHER 2.5 ıı and we have a choice, then we should choose the 
x; to be as spread out as possible. This is sometimes 
have X = 0. What is Var(A,) in this case? possible with experimental data, but rarely do we 
Hint: For any sample of numbers, $; x? = have this luxury in the social sciences: usually, we 
E(x% — X)’, with equality only if x = 0.] must take the x; that we obtain via random sampling. 
Sometimes, we have an opportunity to obtain larger 
sample sizes, although this can be costly. 

For the purposes of constructing confidence intervals and deriving test statistics, we will need to 
work with the standard deviations of By and Bo, sd(ĝ,) and sd(ĝo). Recall that these are obtained by 
taking the square roots of the variances in (2.57) and (2.58). In particular, sd( Êi) = o/V SST,, where 
g is the square root of a”, and V SST, is the square root of SST,. 


Show that, when estimating Bo, it is best to 


2-5c Estimating the Error Variance 


The formulas in (2.57) and (2.58) allow us to isolate the factors that contribute to Var(,) and Var( Bo). 
But these formulas are unknown, except in the extremely rare case that a? is known. Nevertheless, we 
can use the data to estimate a, which then allows us to estimate Var(8,) and Var( Bo). 

This is a good place to emphasize the difference between the errors (or disturbances) and the 
residuals, as this distinction is crucial for constructing an estimator of o*. Equation (2.48) shows how 
to write the population model in terms of a randomly sampled observation as y; = By + Bix; + u; 
where u; is the error for observation i. We can also express y; in terms of its fitted value and residual as 
in equation (2.32): y; = Bo + Bux; + i; Comparing these two equations, we see that the error shows 


THEOREM 
2.3 
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up in the equation containing the population parameters, Bọ and B,. On the other hand, the residuals 
show up in the estimated equation with Bo and B ,. The errors are never observed, while the residuals 
are computed from the data. 

We can use equations (2.32) and (2.48) to write the residuals as a function of the errors: 


a; = y; Bo flee F (Bo + Bix; + u;) E Êo 5 Bix; 


or 


ii; = u; (Bo Bo) (Êi B,)%;. [2.59] 


Although the expected value of Bo equals Bo, and similarly for B p 4; is not the same as u;. The differ- 
ence between them does have an expected value of zero. 

Now that we understand the difference between the errors and the residuals, we can return to esti- 
mating o°. First, o° = E(u’), so an unbiased “estimator” of o° is n`! £ ;_; u?. Unfortunately, this is not 
a true estimator, because we do not observe the errors u;. But, we do have estimates of the u;, namely, 
the OLS residuals @;. If we replace the errors with the OLS residuals, we have n-'S)_, à? = SSR/n. 
This is a true estimator, because it gives a computable rule for any sample of data on x and y. One 
slight drawback to this estimator is that it turns out to be biased (although for large n the bias is 
small). Because it is easy to compute an unbiased estimator, we use that instead. 

The estimator SSR/n is biased essentially because it does not account for two restrictions that 
must be satisfied by the OLS residuals. These restrictions are given by the two OLS first order 
conditions: 


=0. [2.60] 


One way to view these restrictions is this: if we know n — 2 of the residuals, we can always get the 
other two residuals by using the restrictions implied by the first order conditions in (2.60). Thus, 
there are only n — 2 degrees of freedom in the OLS residuals, as opposed to n degrees of freedom 
in the errors. It is important to understand that if we replace i; with u; in (2.60), the restrictions would 
no longer hold. 

The unbiased estimator of g’? that we will use makes a degrees of freedom adjustment: 


1 n 
g = Xa; = SSR/(n — 2). [2.61] 
(n- 2) £ 
(This estimator is sometimes denoted as S7, but we continue to use the convention of putting “hats” 
over estimators.) 


UNBIASED ESTIMATION OF g? 
Under Assumptions SLR.1 through SLR.5, 


PROOF: \f we average equation (2.59) across all / and use the fact that the OLS residuals average out to zero, 


we have 0 = U — (By — Bo) — (Ê; — B;)X; subtracting this from (2.59) gives ô; = (u; — U) — (B; — B:) 
(x; — X). Therefore, 0? = (u; — U)? + (Ê, — B,)*(x; — X)? — 2(u; — T)(B, — B)(x% — X). Summing 
across all i gives Dj_,07 = D7Ly(u; — U)? + (By — By)? Efl — X)? — 2(B,) — By) E u(x — X). 
Now, the expected value of the first term is (n — 1)o?, something that is shown in Math Refresher C. The 
expected value of the second term is simply a? because E[(8, — B,)?] = Var(B;) = o?/SST,. Finally, 
the third term can be written as —2(8, — B,)?SST,; taking expectations gives —20*. Putting these three 
terms together gives E(>7.,0?) = (n — 1)0* + o? — 20° = (n — 2)’, so that E[SSR/(n — 2)] = 0°. 
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If ô? is plugged into the variance formulas (2.57) and (2.58), then we have unbiased estimators of 
Var(,) and Var(,). Later on, we will need estimators of the standard deviations of 8, and Bo, and 
this requires estimating ø. The natural estimator of ø is 


ô = VE [2.62] 


and is called the standard error of the regression (SER). (Other names for ô are the standard error 
of the estimate and the root mean squared error, but we will not use these.) Although & is not an unbi- 
ased estimator of ø, we can show that it is a consistent estimator of ø (see Math Refresher C), and it 
will serve our purposes well. 

The estimate ô is interesting because it is an estimate of the standard deviation in the unobservables 
affecting y; equivalently, it estimates the standard deviation in y after the effect of x has been taken out. 
Most regression packages report the value of ô along with the R-squared, intercept, slope, and other OLS 
statistics (under one of the several names listed above). For now, our primary interest is in using Ô to esti- 
mate the standard deviations of Bo and Bi. Because sd(B;) =a/ VSST,, the natural estimator of sd( B) is 


fi 1/2 
) = a/Vsst, = 6i( $0 = 2") 
i=l 


this is called the standard error of Ê.. Note that se( Êi ) is viewed as a random variable when we think of 
running OLS over different samples of y; this is true because & varies with different samples. For a given 
sample, se( Êi ) is a number, just as Ba is simply a number when we compute it from the given data. 

Similarly, se(B) is obtained from sd( ĝo) by replacing ø with &. The standard error of any esti- 
mate gives us an idea of how precise the estimator is. Standard errors play a central role throughout 
this text; we will use them to construct test statistics and confidence intervals for every econometric 
procedure we cover, starting in Chapter 4. 


2-6 Regression through the Origin and Regression on a Constant 


In rare cases, we wish to impose the restriction that, when x = 0, the expected value of y is zero. 
There are certain relationships for which this is reasonable. For example, if income (x) is zero, then 
income tax revenues (y) must also be zero. In addition, there are settings where a model that originally 
has a nonzero intercept is transformed into a model without an intercept. 

Formally, we now choose a slope estimator, which we call B p and a line of the form 


y = Bix, [2.63] 
where the tildes over Bi and y are used to distinguish this problem from the much more common 
problem of estimating an intercept along with a slope. Obtaining (2.63) is called regression through 
the origin because the line (2.63) passes through the point x = 0, y = 0. To obtain the slope estimate 
in (2.63), we still rely on the method of ordinary least squares, which in this case minimizes the sum 
of squared residuals: 


n 


> y; T Bix)’. [2.64] 
Using one-variable calculus, it can be shown that Bi must solve the first order condition: 
Sao — Bx) = 0. [2.65] 


From this, we can solve for 6: 


A =, [2.66] 


provided that not all the x; are zero, a case we rule out. 
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Note how Bi compares with the slope estimate when we also estimate the intercept (rather than 
set it equal to zero). These two estimates are the same if, and only if, x = 0. [See equation (2.49) for 
Bal Obtaining an estimate of 6, using regression through the origin is not done very often in applied 
work, and for good reason: if the intercept By # 0, then B, is a biased estimator of 6,. You will be 
asked to prove this in Problem 8. 

In cases where regression through the origin is deemed appropriate, one must be careful in inter- 
preting the R-squared that is typically reported with such regressions. Usually, unless stated otherwise, 
the R-squared is obtained without removing the sample average of {y; i = 1, ..., n} in obtaining 
SST. In other words, the R-squared is computed as 


> (y; = Bix) 
j= n [2.67] 


The numerator here makes sense because it is the sum of squared residuals, but the denominator 
acts as if we know the average value of y in the population is zero. One reason this version of the 
R-squared is used is that if we use the usual total sum of squares, that is, we compute R-squared as 


, [2.68] 


it can actually be negative. If expression (2.68) is negative then it means that using the sample average 
y to predict y; provides a better fit than using x; in a regression through the origin. Therefore, (2.68) is 
actually more attractive than equation (2.67) because equation (2.68) tells us whether using x is better 
than ignoring x altogether. 

This discussion about regression through the origin, and different ways to measure goodness- 
of-fit, prompts another question: what happens if we only regress on a constant? That is, what if we 
set the slope to zero (which means we need not even have an x) and estimate an intercept only? The 
answer is simple: the intercept is y. This fact is usually shown in basic statistics, where it is shown 
that the constant that produces the smallest sum of squared deviations is always the sample average. 
In this light, equation (2.68) can be seen as comparing regression on x through the origin with regres- 
sion only on a constant. 


2-7 Regression on a Binary Explanatory Variable 


Our discussion so far has centered on the case where the explanatory variable, x, has quantitative 
meanings. A few examples include years of schooling, return on equity for a firm, and the percent- 
age of students at a school eligible for the federal free lunch program. We know how to interpret the 
slope coefficient in each case. We also discussed interpretation of the slope coefficient when we use 
the logarithmic transformations of the explained variable, the explanatory variable, or both. 

Simple regression can also be applied to the case where x is a binary variable, often called a 
dummy variable in the context of regression analysis. As the name “binary variable” suggests, x 
takes on only two values, zero and one. These two values are used to put each unit in the population 
into one of two groups represented by x = 0 and x = 1. For example, we can use a binary variable to 
describe whether a worker participates in a job training program. In the spirit of giving our variables 
descriptive names, we might use train to indicate participation: train = 1 means a person participates; 
train = 0 means the person does not. Given a data set, we add an i subscript, as usual, so train, indi- 
cates job training status for a randomly drawn person i. 
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If we have a dependent or response variable, y, what does it mean to have a simple regression 
equation when x is binary? Consider again the equation 


y=Pot Bixtu 


but where now x is a binary variable. If we impose the zero conditional mean assumption SLR.4 then 
we obtain 


E(y|x) = Bo + Bix + E(u|x) = Bo + Bix, [2.69] 


just as in equation (2.8). The only difference now is that x can take on only two values. By plugging 
the values zero and one into (2.69), it is easily seen that 


EQlx = 0) = By [2.70] 
EQ|x = 1) = b + bı- [2.71] 

It follows immediately that 
Bı = EQ|x = 1) — EQ|x = 0). [2.72] 


In other words, 6; is the difference in the average value of y over the subpopulations with x = 1 and 
x = 0. As with all simple regression analyses, this difference can be descriptive or, in a case discussed 
in the next subsection, 8, can be a causal effect of an intervention or a program. 

As an example, suppose that every worker in an hourly wage industry is put into one of two 
racial categories: white (or Caucasian) and nonwhite. (Clearly this is a very crude way to categorize 
race, but it has been used in some contexts.) Define the variable white = 1 if a person is classified as 
Caucasian and zero otherwise. Let wage denote hourly wage. Then 


Bı = E(wage|white = 1) — E(wage|white = 0) 
is the difference in average hourly wages between white and nonwhite workers. Equivalently, 
E(wage|white) = By + B,white. 


Notice that 8, always has the interpretation that it is the difference in average wages between whites 
and nonwhites. However, it does not necessarily measure wage discrimination because there are many 
legitimate reasons wages can differ, and some of those—such as education levels—could differ, on 
average, by race. 

The mechanics of OLS do not change just because x is binary. Let {(x,, y,):i = 1,..., n} be the 
sample of size n. The OLS intercept and slope estimators are always given by (2.16) and (2.19), respec- 
tively. The residuals always have zero mean and are uncorrelated with the x; in the sample. The defini- 
tion of R-squared is unchanged. And so on. Nevertheless, because x; is binary, the OLS estimates have 
a simple, sensible interpretation. Let yọ be the average of the y; with x; = 0 and y; the average when 
x; = 1. Problem 2.13 asks you to show that 


Bo = Yo [2.73] 
Bo = yi — Yo: [2.74] 
For example, in the wage/race example, if we run the regression 
wage, on white, i = 1,...,n 


then Bo = wage, the average hourly wage for nonwhites, and Êi = wage, — Wagep, the difference 
in average hourly wages between whites and nonwhites. Generally, equation (2.74) shows that the 
“slope” in the regression is the difference in means, which is a standard estimator from basic statistics 
when comparing two groups. 

The statistical properties of OLS are also unchanged when x is binary. In fact, nowhere is this 
ruled out in the statements of the assumptions. Assumption SLR.3 is satisfied provided we see some 
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zeros and some ones for x; in our sample. For example, in the wage/race example, we need to observe 
some whites and some nonwhites in order to obtain B i 

As with any simple regression analysis, the main concern is the zero conditional mean assumption, 
SLR.4. In many cases, this condition will fail because x is systematically related to other factors that 
affect y, and those other factors are necessarily part of u. We alluded to this above in discussing differences 
in average hourly wage by race: education and workforce experience are two variables that affect hourly 
wage that could systematically differ by race. As another example, suppose we have data on SAT scores 
for students who did and did not take at least one SAT preparation course. Then x is a binary variable, say, 
course, and the outcome variable is the SAT score, sat. The decision to take the preparation course could 
be systematically related to other factors that are predictive of SAT scores, such as family income and 
parents’ education. A comparison of average SAT scores between the two groups is unlikely to uncover 
the causal effect of the preparation course. The framework covered in the next subsection allows us to 
determine the special circumstances under which simple regression can uncover a causal effect. 


2-7a Counterfactual Outcomes, Causality, and Policy Analysis 


Having introduced the notion of a binary explanatory variable, now is a good time to provide a formal 
framework for studying counterfactual or potential outcomes, as touched on briefly in Chapter 1. We 
are particularly interested in defining a causal effect or treatment effect. 

In the simplest case, we are interested in evaluating an intervention or policy that has only two 
states of the world: a unit is subjected to the intervention or not. In other words, those not subject 
to the intervention or new policy act as a control group and those subject to the intervention as the 
treatment group. Using the potential outcomes framework introduced in Chapter 1, for each unit 7 
in the population we assume there are outcomes in both states of the world, y; (0) and y; (1). We will 
never observe any unit in both states of the world but we imagine each unit in both states. For exam- 
ple, in studying a job training program, a person does or does not participate. Then y; (0) is earnings if 
person i does not participate and y; (1) is labor earnings if i does participate. These outcomes are well 
defined before the program is even implemented. 

The causal effect, somewhat more commonly called the treatment effect, of the intervention for 
unit 7 is simply 


te; = yi1) — yO), [2.75] 


the difference between the two potential outcomes. There are a couple of noteworthy items about te, 
First, it is not observed for any unit i because it depends on both counterfactuals. Second, it can be 
negative, zero, or positive. It could be that the causal effect is negative for some units and positive for 
others. 

We cannot hope to estimate te; for each unit i. Instead, the focus is typically on the average 
treatment effect (ATE), also called the average causal effect (ACE). The ATE is simply the aver- 
age of the treatment effects across the entire population. (Sometimes for emphasis the ATE is called 
the population average treatment effect.) We can write the ATE parameter as 


Tae = Elte] = Eyi) = y0)] = Ely] — Ely], [2.76] 


where the final expression uses linearity of the expected value. Sometimes, to emphasize the popu- 
lation nature of T, we write Tae = E[y(1) — y(0)], where [y(0), y(1)] are the two random variables 
representing the counterfactual outcomes in the population. 

For each unit i let x; be the program participation status—a binary variable. Then the observed 
outcome, y; can be written as 


yi 5 (A = xp)y(O) + xy; A), [2.77] 


which is just shorthand for y; = y; (0) if x; = 0 and y; = y; (1) if x; = 1. This equation precisely 
describes why, given a random sample from the population, we observe only one of y;(0) and y,(1). 
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To see how to estimate the average treatment effect, it is useful to rearrange (2.77): 


yi = yO) + y1) — yO) }x; [2.78] 
Now impose a simple (and, usually, unrealistic) constant treatment effect. Namely, for all i, 
y(1) = 7 + y;(0), [2.79] 


or T = y(1) — yO). Plugging this into (2.78) gives 
Yi = YAO) + Tx; 


Now write y,(0) = ag + u,(0) where, by definition, ag = E[y;(0)] and E[u,(0)] = 0. Plugging this 
in gives 


Yi = Qo + Tx; + uO). [2.80] 
If we define By = a, 8; = T, and u; = u;(0) then the equation becomes exactly as in equation (2.48): 


Yi = Bo + Bix; + u, 


where 8, = 7 is the treatment (or causal) effect. 
We can easily determine that the simple regression estimator, which we now know is the differ- 
ence in means estimator, is unbiased for the treatment effect, 7. If x; is independent of u,(0) then 


E[u(0)|x,] = 0, 


so that SLR.4 holds. We have already shown that SLR.1 holds in our derivation of (2.80). As usual, 
we assume random sampling (SLR.2), and SLR.3 holds provided we have some treated units and 
some control units, a basic requirement. It is pretty clear we cannot learn anything about the effect of 
the intervention if all sampled units are in the control group or all are in the treatment group. 

The assumption that x; is independent of u,(0) is the same as x; is independent of y,(0). This 
assumption can be guaranteed only under random assignment, whereby units are assigned to the 
treatment and control groups using a randomization mechanism that ignores any features of the indi- 
vidual units. For example, in evaluating a job training program, random assignment occurs if a coin 
is flipped to determine whether a worker is in the control group or treatment group. (The coin can be 
biased in the sense that the probability of a head need not be 0.5.) Random assignment can be com- 
promised if units do not comply with their assignment. 

Random assignment is the hallmark of a randomized controlled trial (RCT), which has long been 
considered the gold standard for determing whether medical interventions have causal effects. RCTs gen- 
erate the kind of experimental data of the type discussed in Chapter 1. In recent years, RCTs have become 
more popular in certain fields in economics, such as development economics and behavioral economics. 
Unfortunately, RCTs can be very expensive to implement, and in many cases randomizing subjects into 
control and treatment groups raises ethical issues. (For example, if giving low-income families access to 
free health care improves child health outcomes then randomizing some families into the control group 
means those children will have, on average, worse health outcomes than they could have otherwise.) 

Even though RCTs are not always feasible for answering particular questions in economics and 
other fields, it is a good idea to think about the experiment one would run if random assignment were 
possible. Working through the simple thought experiment typically ensures that one is asking a sen- 
sible question before gathering nonexperimental data. For example, if we want to study the effects 
of Internet access in rural areas on student performance, we might not have the resources (or ethical 
clearance) to randomly assign Internet access to some students and not others. Nevertheless, thinking 
about how such an experiment would be implemented sharpens our thinking about the potential out- 
comes framework and what we mean by the treatment effect. 

Our discussion of random assignment so far shows that, in the context of a constant treatment 
effect, the simple difference-in-means estimator, y} — yo, is unbiased for 7. We can easily relax the 
constant treatment effect assumption. In general, the individual treatment effect can be written as 


te; = v1) — yO) = Tae + (uj) — u(0)], [2.81] 
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where y(1) = a, + u,(1) and Tae = a; — Mp. It is helpful to think of Tare as the average across the 
entire population and u;(1 ) — u;(0) as the deviation from the population average for unit i. Plugging 
(2.81) into (2.78) gives 


Yi = Aq F TateXi + UO) + [uj1) — uO) x; = Qo + Tatexj + Ui, [2.82] 
where the error term is now 
u; = uO) + [u(1) — u;(0)]x;. 


The random assignment assumption is now that x; is independent of [u;(0), u,(1)]. Even though u; 
depends on x;, the zero conditional mean assumption holds: 


E(u;|x)) = E[u;(0)|x;] + Efu,1) — u0)|x)]x; 
=0+0:- Xi = 0. 


We have again verified SLR.4, and so we conclude that the simple OLS estimator is unbiased for a and 
T areswWhere T4,. is the difference-in-means estimator. [The error u; is not independent of x;. In particular, 
as shown in Problem 2.17, Var(u,|x;) differs across x; = 1 and x; = 0 if the variances of the potential out- 
comes differ. But remember, Assumption SLR.5 is not used to show the OLS estimators are unbiased. ] 

The fact that the simple regression estimator produces an unbiased estimator Tar when the treat- 
ment effects can vary arbitrarily across individual units is a very powerful result. However, it relies 
heavily on random assignment. Starting in Chapter 3, we will see how multiple regression analysis 
can be used when pure random assignment does not hold. Chapter 20, available as an online supple- 
ment, contains an accessible survey of advanced methods for estimating treatment effects. 


Evaluating a Job Training Program 


The data in JTRAIN2 are from an old, experimental job training program, where men with poor labor 
market histories were assigned to control and treatment groups. This data set has been used widely in 
the program evaluation literature to compare estimates from nonexperimental programs. The training 
assignment indicator is train and here we are interested in the outcome re78, which is (real) earnings in 
1978 measured in thousands of dollars. Of the 445 men in the sample, 185 participated in the program 
in a period prior to 1978; the other 260 men comprise the control group. 

The simple regression gives 


7eT8 = 4.55 + 1.79 train 
n = 445, R? = 0.018. 


From the earlier discussion, we know that 1.79 is the difference in average re78 between the treated 
and control groups, so men who participated in the program earned an average of $1,790 more than 
the men who did not. This is an economically large effect, as the dollars are 1978 dollars. Plus, the 
average earnings for men who did not participate is $4,550; in percentage terms, the gain in average 
earnings is about 39.3%, which is large. (We would need to know the costs of the program to do a 
benefit-cost analysis, but the benefits are nontrivial.) 

Remember that the fundamental issue in program evaluation is that we do not observe any of the 
units in both states of the world. In this example, we only observe one of the two earnings outcomes 
for each men. Neverthless, random assignment into the treatment and control groups allows us to get 
an unbiased estimator of the average treatment effect. 

Two final comments on this example. First, notice the very small R-squared: the training par- 
ticipation indicator explains less than two percent of the variation in re78 in the sample. We should 
not be surprised: many other factors, including education, experience, intelligence, age, motivation, 
and so on help determine labor market earnings. This is a good example to show how focusing on 
R-squared is not only unproductive, but it can be harmful. Beginning students sometimes think a 
small R-squared indicates “bias” in the OLS estimators. It does not. It simply means that the variance 
in the unobservables, Var(u), is large relative to Var(y). In this example, we know that Assumptions 
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SLR.1 to SLR.4 hold because of random assignment. Rightfully, none of these assumptions mentions 
how large R-squared must be; it is immaterial for the notion of unbiasedness. 

A second comment is that, while the estimated economic effect of $1,790 is large, we do not 
know whether this estimate is statistically significant. We will come to this topic in Chapter 4. 

Before ending this chapter, it is important to head off possible confusion about two different ways 
the word “random” has been used in this subsection. First, the notion of random sampling is the one 
introduced in Assumption SLR.2 (and also discussed in Math Refresher C). Random sampling means 
that the data we obtain are independent, identically distributed draws from the population distribution 
represented by the random variables (x, y). It is important to understand that random sampling is a sepa- 
rate concept from random assignment, which means that x; is determined independently of the counter- 
factuals [y,(0), y(1)]. In Example 2.14, we obtained a random sample from the relevant population, and 
the assignment to treatment and control is randomized. But in other cases, random assignment will not 
hold even though we have random sampling. For example, it is relatively easy to draw a random sample 
from a large population of college-bound students and obtain outcomes on their SAT scores and whether 
they participated in an SAT preparation course. That does not mean that participation in a course is inde- 
pendent of the counterfactual outcomes. If we wanted to ensure independence between participation and 
the potential outcomes, we would randomly assign the students to take a course or not (and insist that 
students adhere to their assignments). If instead we obtain retrospective data—that is, we simply record 
whether a student has taken a preparation course—then the independence assumption underlying RA is 
unlikely to hold. But this has nothing to do with whether we obtained a random sample of students from 
the population. The general point is that Assumptions SLR.2 and SLR.4 are very different. 


Summary 


We have introduced the simple linear regression model in this chapter, and we have covered its basic prop- 
erties. Given a random sample, the method of ordinary least squares is used to estimate the slope and 
intercept parameters in the population model. We have demonstrated the algebra of the OLS regression 
line, including computation of fitted values and residuals, and the obtaining of predicted changes in the 
dependent variable for a given change in the independent variable. In Section 2-4, we discussed two issues 
of practical importance: (1) the behavior of the OLS estimates when we change the units of measurement 
of the dependent variable or the independent variable and (2) the use of the natural log to allow for constant 
elasticity and constant semi-elasticity models. 

In Section 2-5, we showed that, under the four Assumptions SLR.1 through SLR.4, the OLS estima- 
tors are unbiased. The key assumption is that the error term u has zero mean, or average, given any value 
of the independent variable x. Unfortunately, there are reasons to think this is false in many social science 
applications of simple regression, where the omitted factors in u are often correlated with x. When we add 
the assumption that the variance of the error given x is constant, we get simple formulas for the sampling 
variances of the OLS estimators. As we saw, the variance of the slope estimator Ê; increases as the error 
variance increases, and it decreases when there is more sample variation in the independent variable. We 
also derived an unbiased estimator for 0? = Var(u). 

In Section 2-6, we briefly discussed regression through the origin, where the slope estimator is 
obtained under the assumption that the intercept is zero. Sometimes, this is useful, but it appears infre- 
quently in applied work. 

In Section 2-7 we covered the important case where x is a binary variable, and showed that 
the OLS “slope” estimate is simply Bo = yı — yo, the difference in the averages of y; between the 
x; = | and x; = 0 subsamples. We also discussed how, in the context of causal inference, Êi is an 
unbiased estimator of the average treatment effect under random assignment into the control and 
treatment groups. In Chapter 3 and beyond, we will study the case where the intervention or treatment 
is not randomized, but depends on observed and even unobserved factors. 

Much work is left to be done. For example, we still do not know how to test hypotheses about the pop- 
ulation parameters, 6, and 6. Thus, although we know that OLS is unbiased for the population parameters 
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under Assumptions SLR.1 through SLR.4, we have no way of drawing inferences about the population. 
Other topics, such as the efficiency of OLS relative to other possible procedures, have also been omitted. 
The issues of confidence intervals, hypothesis testing, and efficiency are central to multiple regression 
analysis as well. Because the way we construct confidence intervals and test statistics is very similar for multi- 
ple regression—and because simple regression is a special case of multiple regression—our time is better spent 
moving on to multiple regression, which is much more widely applicable than simple regression. Our purpose 
in Chapter 2 was to get you thinking about the issues that arise in econometric analysis in a fairly simple setting. 


THE GAUSS-MARKOV ASSUMPTIONS FOR SIMPLE REGRESSION 


For convenience, we summarize the Gauss-Markov assumptions that we used in this chapter. It is impor- 
tant to remember that only SLR.1 through SLR.4 are needed to show Êo and Ê; are unbiased. We added the 
homoskedasticity assumption, SLR.5, to obtain the usual OLS variance formulas (2.57) and (2.58). 


Assumption SLR.1 (Linear in Parameters) 
In the population model, the dependent variable, y, is related to the independent variable, x, and the error 
(or disturbance), u, as 


y = Bo + Bx + u, 
where fy and £; are the population intercept and slope parameters, respectively. 


Assumption SLR.2 (Random Sampling) 
We have a random sample of size n, {(x;,y;):i = 
Assumption SLR.1. 


1,234.37} following the population model in 


Assumption SLR.3 (Sample Variation in the Explanatory Variable) 
The sample outcomes on x, namely, {xpi = 1,...,n}, are not all the same value. 


Assumption SLR.4 (Zero Conditional Mean) 


The error u has an expected value of zero given any value of the explanatory variable. In other words, 


E(ulx) = 0. 


Assumption SLR.5 (Homoskedasticity) 
The error u has the same variance given any value of the explanatory variable. In other words, 
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=) Problems 


1 Let kids denote the number of children ever born to a woman, and let educ denote years of education 
for the woman. A simple model relating fertility to years of education is 


kids = By + B educ + u, 


where u is the unobserved error. 

(i) | What kinds of factors are contained in u? Are these likely to be correlated with level of education? 

(ii) Will a simple regression analysis uncover the ceteris paribus effect of education on fertility? 
Explain. 


2 In the simple linear regression model y = By + B,x + u, suppose that E(u) # 0. Letting ay = E(u), 
show that the model can always be rewritten with the same slope, but a new intercept and error, where 
the new error has a zero expected value. 


3 The following table contains the ACT scores and the GPA (grade point average) for eight college stu- 
dents. Grade point average is based on a four-point scale and has been rounded to one digit after the 


decimal. 
Student GPA ACT 
1 2.8 21 
2 3.4 24 
3 3.0 26 
4 3.5 27 
5 3.6 29 
6 3.0 25 
7 Pal 25 
8 3.7 30 


(i) Estimate the relationship between GPA and ACT using OLS; that is, obtain the intercept and 
slope estimates in the equation 


pet, a A 
GPA = ÊB, + BACT. 


Comment on the direction of the relationship. Does the intercept have a useful interpretation 
here? Explain. How much higher is the GPA predicted to be if the ACT score is increased by 
five points? 

(ii) Compute the fitted values and residuals for each observation, and verify that the residuals 
(approximately) sum to zero. 

(iii) What is the predicted value of GPA when ACT = 20? 

(iv) How much of the variation in GPA for these eight students is explained by ACT? Explain. 


4 The data set BWGHT contains data on births to women in the United States. Two variables of interest 
are the dependent variable, infant birth weight in ounces (bwght), and an explanatory variable, average 
number of cigarettes the mother smoked per day during pregnancy (cigs). The following simple regres- 
sion was estimated using data on n = 1,388 births: 


a. 
bwght = 119.77 — 0.514 cigs 


(i) | What is the predicted birth weight when cigs = 0? What about when cigs = 20 (one pack per 
day)? Comment on the difference. 

Gi) Does this simple regression necessarily capture a causal relationship between the child’s birth 
weight and the mother’s smoking habits? Explain. 
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(iii) To predict a birth weight of 125 ounces, what would cigs have to be? Comment. 
(iv) The proportion of women in the sample who do not smoke while pregnant is about .85. Does 
this help reconcile your finding from part (iii)? 


In the linear consumption function 
—~ a Aes 
cons = By + Byinc, 


the (estimated) marginal propensity to consume (MPC) out of income is simply the slope, Bi, 
while the average propensity to consume (APC) is Cons/inc = Bol inc + Ê.. Using observations 
for 100 families on annual income and consumption (both measured in dollars), the following 
equation is obtained: 


cons = —124.84 + 0.853 inc 
n = 100, R? = 0.692. 


(i) Interpret the intercept in this equation, and comment on its sign and magnitude. 
(ii) | What is the predicted consumption when family income is $30,000? 
(iii) With inc on the x-axis, draw a graph of the estimated MPC and APC. 


Using data from 1988 for houses sold in Andover, Massachusetts, from Kiel and McClain (1995), 
the following equation relates housing price (price) to the distance from a recently built garbage incin- 
erator (dist): 


m 
log(price) = 9.40 + 0.312 log(dist) 
n = 135, R? = 0.162. 


(i) Interpret the coefficient on log(dist). Is the sign of this estimate what you expect it to be? 

Gi) Do you think simple regression provides an unbiased estimator of the ceteris paribus 
elasticity of price with respect to dist? (Think about the city’s decision on where to put 
the incinerator.) 

(iii) What other factors about a house affect its price? Might these be correlated with distance from 
the incinerator? 


Consider the savings function 


sav = By + Byinc + u, u = Vinc-e, 


where e is a random variable with E(e) = 0 and Var(e) = a2. Assume that e is independent 

of inc. 

(i) | Show that E(ulinc) = 0, so that the key zero conditional mean assumption (Assumption SLR.4) 
is satisfied. [Hint: If e is independent of inc, then E(elinc) = E(e).] 

(ii) Show that Var(ulinc) = o2inc, so that the homoskedasticity Assumption SLR.5 is violated. In 
particular, the variance of sav increases with inc. [Hint: Var(elinc) = Var(e) if e and inc are 
independent. ] 

(iii) Provide a discussion that supports the assumption that the variance of savings increases with 
family income. 


Consider the standard simple regression model y = By + B,x + u under the Gauss-Markov 

Assumptions SLR.1 through SLR.5. The usual OLS estimators Bo and Êi are unbiased for their respec- 

tive population parameters. Let 8, be the estimator of 8, obtained by assuming the intercept is zero 

(see Section 2-6). 

(i) Find E(,) in terms of the x;, Bo, and B,. Verify that B, is unbiased for B, when the population 
intercept (Bp) is zero. Are there other cases where 3, is unbiased? 

(ii) Find the variance of ĝ,. (Hint: The variance does not depend on fy.) 
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(iii) Show that Var(@,) < Var({,). [Hint: For any sample of data, Xiu? = S}_,(x; — x)?, with 
strict inequality unless x = 0.] 
(iv) Comment on the tradeoff between bias and variance when choosing between Êi and ĝ}. 


(i) Let Bo and B, be the intercept and slope from the regression of y; on x; using n observations. Let 
cı and cz, with c, # 0, be constants. Let By and B , be the intercept and slope from the regression 
of ciy; on cax;. Show that B, = (¢,/cy) Bo and By = cı Bos thereby verifying the claims on units of 
measurement in Section 2-4. [Hint: To obtain B, plug the scaled versions of x and y into (2.19). 
Then, use (2.17) for Bo, being sure to plug in the scaled x and y and the correct slope. ] 

(ii) Now, let By and B, be from the regression of (c, + y,) on (c, + x;) (with no restriction on c; or c3). 
Show that 8, = Ê: and By = Bo C= CoP. 

Gii) Now, let Bo and Êi be the OLS estimates from the regression log(y;) on x, where we must 
assume y; > 0 for all i. For c, > 0, let Bo and Bı be the intercept and slope from the regression 
of log(c,y;) on x;. Show that B, = Ê, and By = log(c,) + Bo. 

(iv) Now, assuming that x; > 0 for all 7, let Bo and Bı be the intercept and slope from the regression 
of y; on log(c>x;). How do Bo and B, compare with the intercept and slope from the regression 


of y; on log(x;)? 


Let By and Ê; be the OLS intercept and slope estimators, respectively, and let u be the sample average 

of the errors (not the residuals!). 

(i) Show that Ê, can be written as Â; = B, + X? wu; where w; = d;/SST, and d; = x; — X. 

(ii) Use part (i), along with >}_ w; = 0, to show that Ê: and u are uncorrelated. [Hint: You are 
being asked to show that E[(8, — B,)-u] = 0.] 

Gii) Show that Bo can be written as Bo = botu- (Èi — B,)x. 

(iv) Use parts (ii) and (iii) to show that Var( Bo) = 07/n + o’ (x) /SST,. 

(v) Do the algebra to simplify the expression in part (iv) to equation (2.58). 
[Hint: SST,/n = n'Eiu — (x 


Suppose you are interested in estimating the effect of hours spent in an SAT preparation course 

(hours) on total SAT score (sat). The population is all college-bound high school seniors for a par- 

ticular year. 

(i) | Suppose you are given a grant to run a controlled experiment. Explain how you would structure 
the experiment in order to estimate the causal effect of hours on sat. 

(ii) Consider the more realistic case where students choose how much time to spend in a prepara- 
tion course, and you can only randomly sample sat and hours from the population. Write the 
population model as 


sat = By + B,hours + u 


where, as usual in a model with an intercept, we can assume E(u) = 0. List at least two factors 
contained in u. Are these likely to have positive or negative correlation with hours? 
(iii) In the equation from part (ii), what should be the sign of £; if the preparation course is effective? 
(iv) In the equation from part (ii), what is the interpretation of By? 


Consider the problem described at the end of Section 2-6, running a regression and only estimating an 
intercept. 
(i) Given a sample {y;: i = 1, 2, ..., n}, let By be the solution to 
min Di = b)’. 
Show that By = y, that is, the sample average minimizes the sum of squared residuals. (Hint: 
You may use one-variable calculus or you can show the result directly by adding and subtract- 
ing y inside the squared residual and then doing a little algebra.) 
(ii) Define residuals u; = y; — y. Argue that these residuals always sum to zero. 
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Let y be any response variable and x a binary explanatory variable. Let {(x;, y,):i = 1,...,n} bea 
sample of size n. Let nọ be the number of observations with x; = 0 and n, the number of observations 
with x; = 1. Let yọ be the average of the y; with x; = 0 and y, the average of the y; with x; = 1. 

Gi) Explain why we can write 


n 
Ng = > x;), nı = DX 
£ 


i= 


Show that x = n,/n and (1 — x) 
Gi) Argue that 


ngn. How do you interpret x? 


Yo = my! (1 = x)¥p Jı = my! xin 
(iii) Show that the average of y; in the entire sample, y, can be written as a weighted average: 
y= (1 — x)¥ + xy. 


[Hint: Write y; = (1 — x;)y; + xyi] 
(iv) Show that when x; is binary, 


n'a - (32 =z — 3). 


[Hint: When x; is binary, x7 = x;.] 
(v) Show that 


nm! D xiy; xy = x(1 x), Yo). 


(vi) Use parts (iv) and (v) to obtain (2.74). 
(vii) Derive equation (2.73). 


In the context of Problem 2.13, suppose y; is also binary. For concreteness, y; indicates whether 
worker 7 is employed after a job training program, where y; = 1 means has a job, y; = 0 means does not 
have a job. Here, x; indicates participation in the job training program. Argue that Ê; is the difference in 
employment rates between those who participated in the program and those who did not. 


Consider the potential outcomes framework from Section 2.7a, where y;(0) and y,(1) are the potential 
outcomes in each treatment state. 
(i) Show that if we could observe y;(0) and y,(1) for all i then an unbiased estimator of 7,,. would be 


nS iy) — x(0)] = X1) — 5(0). 


This is sometimes called the sample average treatment effect. 
(ii) Explain why the observed sample averages, yọ and y,, are not the same as y(0) and y(1), 
respectively, by writing yọ and y, in terms of yO) and y,(1), respectively. 


In the potential outcomes framework, suppose that program eligibility is randomly assigned but par- 
ticipation cannot be enforced. To formally describe this situation, for each person i, z; is the eligibil- 
ity indicator and x; is the participation indicator. Randomized eligibility means z; is independent of 
[y:(0), y(1)] but x; might not satisfy the independence assumption. 

(i) Explain why the difference in means estimator is generally no longer unbiased. 

Gi) In the context of a job training program, what kind of individual behavior would cause bias? 


In the potential outcomes framework with heterogeneous (nonconstant) treatment effect, write the 
error as 


u; = (1 — x,u(O) + xul). 


Let oå = Var[u,(0)] and of = Var[u,(1)]. Assume random assignment. 
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(i) Find Var(u;|x;). 
(ii) When is Var(u;|x;) constant? 


Let x be a binary explanatory variable and suppose P(x = 1) = p forO<p< 1. 

(i) If you draw a random sample of size n, find the probability—call it y,—that Assumption SLR.3 
fails. [Hint: Find the probability of observing all zeros or all ones for the x;.] Argue that y, — 0 
as n —> ©, 

(ii) Ifp = 0.5, compute the probablity in part (i) for n = 10 and n = 100. Discuss. 

(iii) Do the calculations from part (ii) with p = 0.9. How do your answers compare with part (11)? 


Computer Exercises 


C1 


C2 


C3 


The data in 401K are a subset of data analyzed by Papke (1995) to study the relationship between 
participation in a 401(k) pension plan and the generosity of the plan. The variable prate is the per- 
centage of eligible workers with an active account; this is the variable we would like to explain. The 
measure of generosity is the plan match rate, mrate. This variable gives the average amount the firm 
contributes to each worker’s plan for each $1 contribution by the worker. For example, if mrate = 0.50, 
then a $1 contribution by the worker is matched by a 50¢ contribution by the firm. 

(i) Find the average participation rate and the average match rate in the sample of plans. 

(ii) | Now, estimate the simple regression equation 


a A A 
prate = By + B, mrate, 


and report the results along with the sample size and R-squared. 

Gii) Interpret the intercept in your equation. Interpret the coefficient on mrate. 

(iv) Find the predicted prate when mrate = 3.5. Is this a reasonable prediction? Explain what is 
happening here. 

(v) How much of the variation in prate is explained by mrate? Is this a lot in your opinion? 


The data set in CEOSAL2 contains information on chief executive officers for U.S. corporations. The 

variable salary is annual compensation, in thousands of dollars, and ceoten is prior number of years as 

company CEO. 

(i) Find the average salary and the average tenure in the sample. 

(ii) How many CEOs are in their first year as CEO (that is, ceoten = 0)? What is the longest tenure 
as a CEO? 

(iii) Estimate the simple regression model 


log(salary) = By + B,ceoten + u, 


and report your results in the usual form. What is the (approximate) predicted percentage 
increase in salary given one more year as a CEO? 


Use the data in SLEEP75 from Biddle and Hamermesh (1990) to study whether there is a tradeoff 
between the time spent sleeping per week and the time spent in paid work. We could use either variable 
as the dependent variable. For concreteness, estimate the model 


sleep = By + B,totwrk + u, 


where sleep is minutes spent sleeping at night per week and fotwrk is total minutes worked dur- 
ing the week. 

(i) | Report your results in equation form along with the number of observations and R?. What does 
the intercept in this equation mean? 

(ii) If totwrk increases by 2 hours, by how much is sleep estimated to fall? Do you find this to be a 
large effect? 


C4 


C5 


C6 


C7 


c8 
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Use the data in WAGE2 to estimate a simple regression explaining monthly salary (wage) in terms of 

IQ score (IQ). 

(i) Find the average salary and average IQ in the sample. What is the sample standard deviation 
of IQ? (IQ scores are standardized so that the average in the population is 100 with a standard 
deviation equal to 15.) 

(ii) Estimate a simple regression model where a one-point increase in JQ changes wage by a con- 
stant dollar amount. Use this model to find the predicted increase in wage for an increase in 
IQ of 15 points. Does JQ explain most of the variation in wage? 

(iii) Now, estimate a model where each one-point increase in JQ has the same percentage effect on 
wage. If IQ increases by 15 points, what is the approximate percentage increase in predicted wage? 


For the population of firms in the chemical industry, let rd denote annual expenditures on research and 

development, and let sales denote annual sales (both are in millions of dollars). 

(i) | Write down a model (not an estimated equation) that implies a constant elasticity between 
rd and sales. Which parameter is the elasticity? 

(ii) Now, estimate the model using the data in RDCHEM. Write out the estimated equation in the 
usual form. What is the estimated elasticity of rd with respect to sales? Explain in words what 
this elasticity means. 


We used the data in MEAP93 for Example 2.12. Now we want to explore the relationship between the 

math pass rate (math/0) and spending per student (expend). 

(Gi) Do you think each additional dollar spent has the same effect on the pass rate, or does a dimin- 
ishing effect seem more appropriate? Explain. 

Gi) In the population model 


math10 = By) + B,log(expend) + u, 


argue that 6,/10 is the percentage point change in math10 given a 10% increase in expend. 

(iii) Use the data in MEAP93 to estimate the model from part (ii). Report the estimated equation in 
the usual way, including the sample size and R-squared. 

(iv) How big is the estimated spending effect? Namely, if spending increases by 10%, what is the 
estimated percentage point increase in math10? 

(v) One might worry that regression analysis can produce fitted values for math10 that are greater 
than 100. Why is this not much of a worry in this data set? 


Use the data in CHARITY [obtained from Franses and Paap (2001)] to answer the following questions: 

(i) What is the average gift in the sample of 4,268 people (in Dutch guilders)? What percentage of 
people gave no gift? 

(ii) What is the average mailings per year? What are the minimum and maximum values? 

(iii) Estimate the model 


gift = By + B,mailsyear + u 


by OLS and report the results in the usual way, including the sample size and R-squared. 
(iv) Interpret the slope coefficient. If each mailing costs one guilder, is the charity expected to make a 
net gain on each mailing? Does this mean the charity makes a net gain on every mailing? Explain. 
(v) What is the smallest predicted charitable contribution in the sample? Using this simple regres- 
sion analysis, can you ever predict zero for gift? 


To complete this exercise you need a software package that allows you to generate data from the uni- 

form and normal distributions. 

(i) Start by generating 500 observations on x—the explanatory variable—from the uniform dis- 
tribution with range [0,10]. (Most statistical packages have a command for the Uniform(0, 1) 
distribution; just multiply those observations by 10.) What are the sample mean and sample 
standard deviation of the x;? 
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c9 


C10 


C11 


(ii) Randomly generate 500 errors, u;, from the Normal(0,36) distribution. (If you generate a 
Normal(0,1), as is commonly available, simply multiply the outcomes by six.) Is the sample 
average of the u; exactly zero? Why or why not? What is the sample standard deviation of the u;? 

(iii) Now generate the y; as 


yi = 1 + 2x, + u; = Bo + Bix; + u; 


that is, the population intercept is one and the population slope is two. Use the data to run the 
regression of y; on x; What are your estimates of the intercept and slope? Are they equal to the 
population values in the above equation? Explain. 

(iv) Obtain the OLS residuals, ĉ;, and verify that equation (2.60) holds (subject to rounding error). 

(v) Compute the same quantities in equation (2.60) but use the errors u; in place of the residuals. 
Now what do you conclude? 

(vi) Repeat parts (i), (ii), and (iii) with a new sample of data, starting with generating the x;. Now 
what do you obtain for Bo and Ê? Why are these different from what you obtained in part (iii)? 


Use the data in COUNTYMURDERS to answer these questions. Use only the data for 1996. 

(i) How many counties had zero murders in 1996? How many counties had at least one execution? 
What is the largest number of executions? 

(ii) Estimate the equation 


murders = By + Byexecs + u 


by OLS and report the results in the usual way, including sample size and R-squared. 

(iii) Interpret the slope coefficient reported in part (11). Does the estimated equation suggest a deter- 
rent effect of capital punishment? 

(iv) What is the smallest number of murders that can be predicted by the equation? What is the 
residual for a county with zero executions and zero murders? 

(v) Explain why a simple regression analysis is not well suited for determining whether capital pun- 
ishment has a deterrent effect on murders. 


The data set in CATHOLIC includes test score information on over 7,000 students in the United States 

who were in eighth grade in 1988. The variables math12 and read12 are scores on twelfth grade stan- 

dardized math and reading tests, respectively. 

(i) | How many students are in the sample? Find the means and standard deviations of math12 and 
read12. 

(ii) Run the simple regression of math12 on read12 to obtain the OLS intercept and slope estimates. 
Report the results in the form 


Pe a e A A 
math\2 = By + B,read12 
n=?,R?=? 


where you fill in the values for Bo and B , and also replace the question marks. 

(iii) Does the intercept reported in part (ii) have a meaningful interpretation? Explain. 

(iv) Are you surprised by the Ê; that you found? What about R°? 

(v) Suppose that you present your findings to a superintendent of a school district, and the 
superintendent says, “Your findings show that to improve math scores we just need to 
improve reading scores, so we should hire more reading tutors.” How would you respond 
to this comment? (Hint: If you instead run the regression of read12 on math12, what would 
you expect to find?) 


Use the data in GPA1 to answer these questions. It is a sample of Michigan State University undergrad- 
uates from the mid-1990s, and includes current college GPA, colGPA, and a binary variable indicating 
whether the student owned a personal computer (PC). 

(i) | How many students are in the sample? Find the average and highest college GPAs. 
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Gi) How many students owned their own PC? 
(iii) Estimate the simple regression equation 


colGPA = By + Bj, PC +u 


and report your estimates for By and 6. Interpret these estimates, including a discussion ofthe 
magnitudes. 

(iv) What is the R-squared from the regression? What do you make of its magnitude? 

(v) Does your finding in part (iii) imply that owning a PC has a causal effect on colGPA? Explain. 


APPENDIX 2A 


Minimizing the Sum of Squared Residuals 


We show that the OLS estimates Bo and Êi do minimize the sum of squared residuals, as asserted 
in Section 2-2. Formally, the problem is to characterize the solutions Bo and Êi to the minimization 
problem 

min Ti — by — bix)’, 

bobi i=1 
where by and b, are the dummy arguments for the optimization problem; for simplicity, call this 
function Q(bo, bı). By a fundamental result from multivariable calculus (see Math Refresher A), a 
necessary condition for Bo and B ı to solve the minimization problem is that the partial derivatives of 
Q(bo, b,) with respect to by and b, must be zero when evaluated at Bo, B1: 9O(By, B,)/dby = 0 and 
d0( Bos Bi) ðb, = 0. Using the chain rule from calculus, these two equations become 


23 Bo Êx) = 0 


—2> xy; E Bo = Bix) = 0. 


These two equations are just (2.14) and (2.15) multiplied by —2n and, therefore, are solved by the 
same Bo and Ê. 

How do we know that we have actually minimized the sum of squared residuals? The first order 
conditions are necessary but not sufficient conditions. One way to verify that we have minimized the 
sum of squared residuals is to write, for any bọ and b}, 


Q(bo, bı) = Shy = Bo = Bix; F (Êo = bo) + (B, = b,)x,P 


i= 


z > [a (Bo bo) 4 (Êi b,)xiP 


= it; H n(Bo bo)? t (Êi bP Dx? H 2(Bo by) (Bi bi) dx 


where we have used equations (2.30) and (2.31). The first term does not depend on bọ or b,, while 
the sum of the last three terms can be written as 

>, [(Bo — by) + (Ê: — b,)x;P, 
as can be verified by straightforward algebra. Because this is a sum of squared terms, the 
smallest it can be is zero. Therefore, it is smallest when by = Bo and b, = Ê.. 
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n Chapter 2, we learned how to use simple regression analysis to explain a dependent variable, y, 

as a function of a single independent variable, x. The primary drawback in using simple regression 

analysis for empirical work is that it is very difficult to draw ceteris paribus conclusions about how 
x affects y: the key assumption, SLR.4—that all other factors affecting y are uncorrelated with x—is 
often unrealistic. 

Multiple regression analysis is more amenable to ceteris paribus analysis because it allows us to 
explicitly control for many other factors that simultaneously affect the dependent variable. This is important 
both for testing economic theories and for evaluating policy effects when we must rely on nonexperimental 
data. Because multiple regression models can accommodate many explanatory variables that may be cor- 
related, we can hope to infer causality in cases where simple regression analysis would be misleading. 

Naturally, if we add more factors to our model that are useful for explaining y, then more of the 
variation in y can be explained. Thus, multiple regression analysis can be used to build better models 
for predicting the dependent variable. 

An additional advantage of multiple regression analysis is that it can incorporate fairly general 
functional form relationships. In the simple regression model, only one function of a single explana- 
tory variable can appear in the equation. As we will see, the multiple regression model allows for 
much more flexibility. 

Section 3-1 formally introduces the multiple regression model and further discusses the advan- 
tages of multiple regression over simple regression. In Section 3-2, we demonstrate how to esti- 


mate the parameters in the multiple regression model using the method of ordinary least squares. 
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In Sections 3-3, 3-4, and 3-5, we describe various statistical properties of the OLS estimators, includ- 
ing unbiasedness and efficiency. 

The multiple regression model is still the most widely used vehicle for empirical analysis in eco- 

nomics and other social sciences. Likewise, the method of ordinary least squares is popularly used for 


estimating the parameters of the multiple regression model. 


3-1 Motivation for Multiple Regression 


3-1a The Model with Two Independent Variables 


We begin with some simple examples to show how multiple regression analysis can be used to solve 
problems that cannot be solved by simple regression. 

The first example is a simple variation of the wage equation introduced in Chapter 2 for obtaining 
the effect of education on hourly wage: 


wage = By + B,educ + Byexper + u, [3.1] 


where exper is years of labor market experience. Thus, wage is determined by the two explanatory 
or independent variables, education and experience, and by other unobserved factors, which are con- 
tained in u. We are still primarily interested in the effect of educ on wage, holding fixed all other fac- 
tors affecting wage; that is, we are interested in the parameter $}. 

Compared with a simple regression analysis relating wage to educ, equation (3.1) effectively 
takes exper out of the error term and puts it explicitly in the equation. Because exper appears in the equation, 
its coefficient, B, measures the ceteris paribus effect of exper on wage, which is also of some interest. 

Not surprisingly, just as with simple regression, we will have to make assumptions about how u in 
(3.1) is related to the independent variables, educ and exper. However, as we will see in Section 3-2, 
there is one thing of which we can be confident: because (3.1) contains experience explicitly, we will 
be able to measure the effect of education on wage, holding experience fixed. In a simple regression 
analysis—which puts exper in the error term—we would have to assume that experience is uncorre- 
lated with education, a tenuous assumption. 

As a second example, consider the problem of explaining the effect of per-student spending 
(expend) on the average standardized test score (avgscore) at the high school level. Suppose that the 
average test score depends on funding, average family income (avginc), and other unobserved factors: 


avgscore = By + Byexpend + B,avginc + u. [3.2] 


The coefficient of interest for policy purposes is B,, the ceteris paribus effect of expend on avgscore. 
By including avginc explicitly in the model, we are able to control for its effect on avgscore. This is 
likely to be important because average family income tends to be correlated with per-student spend- 
ing: spending levels are often determined by both property and local income taxes. In simple regres- 
sion analysis, avginc would be included in the error term, which would likely be correlated with 
expend, causing the OLS estimator of 8, in the two-variable model to be biased. 

In the two previous similar examples, we have shown how observable factors other than the vari- 
able of primary interest [educ in equation (3.1) and expend in equation (3.2)] can be included in a 
regression model. Generally, we can write a model with two independent variables as 


y = Bo + Bix, + Boxy + u, [3.3] 
where 


Bo is the intercept. 
B, measures the change in y with respect to x,, holding other factors fixed. 
B2 measures the change in y with respect to x,, holding other factors fixed. 
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Multiple regression analysis is also useful for generalizing functional relationships between 
variables. As an example, suppose family consumption (cons) is a quadratic function of family 
income (inc): 


cons = By + Byine + Bine? + u, [3.4] 


where u contains other factors affecting consumption. In this model, consumption depends on only 
one observed factor, income; so it might seem that it can be handled in a simple regression frame- 
work. But the model falls outside simple regression because it contains two functions of income, inc 
and inc’ (and therefore three parameters, Bọ, B1, and B,). Nevertheless, the consumption function is 
easily written as a regression model with two independent variables by letting x, = inc and x, = inc’. 

Mechanically, there will be no difference in using the method of ordinary least squares (intro- 
duced in Section 3-2) to estimate equations as different as (3.1) and (3.4). Each equation can be 
written as (3.3), which is all that matters for computation. There is, however, an important differ- 
ence in how one interprets the parameters. In equation (3.1), 6; is the ceteris paribus effect of educ 
on wage. The parameter 6, has no such interpretation in (3.4). In other words, it makes no sense to 
measure the effect of inc on cons while holding inc’ fixed, because if inc changes, then so must inc”! 
Instead, the change in consumption with respect to the change in income—the marginal propensity to 
consume—is approximated by 


Acons 


Kas = Bı + 2B,inc. 
See Math Refresher A for the calculus needed to derive this equation. In other words, the marginal 
effect of income on consumption depends on $, as well as on £, and the level of income. This exam- 
ple shows that, in any particular application, the definitions of the independent variables are crucial. 
But for the theoretical development of multiple regression, we can be vague about such details. We 
will study examples like this more completely in Chapter 6. 

In the model with two independent variables, the key assumption about how u is related to 
x, and x, is 


E(ulx,, x») = 0. [3.5] 


The interpretation of condition (3.5) is similar to the interpretation of Assumption SLR.4 for simple 
regression analysis. It means that, for any values of x; and x, in the population, the average of the 
unobserved factors is equal to zero. As with simple regression, the important part of the assumption 
is that the expected value of u is the same for all combinations of x, and x,; that this common value 
is zero is no assumption at all as long as the intercept By is included in the model (see Section 2-1). 
How can we interpret the zero conditional mean assumption in the previous examples? In equa- 
tion (3.1), the assumption is E(u\educ,exper) = 0. This implies that other factors affecting wage are 
not related on average to educ and exper. Therefore, 
if we think innate ability is part of u, then we will 
need average ability levels to be the same across all 
combinations of education and experience in the 
working population. This may or may not be true, 


GOING FURTHER 3.1 


A simple model to explain city murder rates 
(murdrate) in terms of the probability of 
conviction (orbconv) and average sentence 


length (avgsen) is 


murdrate = By + B,jrbconv 
+ Boavgsen + u. 


What are some factors contained in u? 
Do you think the key assumption (3.5) is 
likely to hold? 


but, as we will see in Section 3-3, this is the ques- 
tion we need to ask in order to determine whether the 
method of ordinary least squares produces unbiased 
estimators. 

The example measuring student performance 
[equation (3.2)] is similar to the wage equa- 
tion. The zero conditional mean assumption is 
E(ulexpend,avginc) = 0, which means that other 
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factors affecting test scores—school or student characteristics—are, on average, unrelated to per- 
student funding and average family income. 

When applied to the quadratic consumption function in (3.4), the zero conditional mean assump- 
tion has a slightly different interpretation. Written literally, equation (3.5) becomes E(ulinc,inc”) = 0. 
Because inc? is known when inc is known, including inc? in the expectation is redundant: 
E(ulinc,inc”) = 0 is the same as E(ulinc) = 0. Nothing is wrong with putting inc” along with inc in 
the expectation when stating the assumption, but E(u|inc) = 0 is more concise. 


3-1b The Model with k Independent Variables 


Once we are in the context of multiple regression, there is no need to stop with two independent vari- 
ables. Multiple regression analysis allows many observed factors to affect y. In the wage example, we 
might also include amount of job training, years of tenure with the current employer, measures of abil- 
ity, and even demographic variables like the number of siblings or mother’s education. In the school 
funding example, additional variables might include measures of teacher quality and school size. 

The general multiple linear regression (MLR) model (also called the multiple regression 
model) can be written in the population as 


y = Bo + Bix, + Box + B3x3 +: + ByXy + u, [3.6] 
where 


By is the intercept. 
bı is the parameter associated with x,. 
GB, is the parameter associated with x,, and so on. 


Because there are k independent variables and an intercept, equation (3.6) contains k + 1 (unknown) 
population parameters. For shorthand purposes, we will sometimes refer to the parameters other 
than the intercept as slope parameters, even though this is not always literally what they are. [See 
equation (3.4), where neither 6, nor $; is itself a slope, but together they determine the slope of the 
relationship between consumption and income. ] 

The terminology for multiple regression is similar to that for simple regression and is given in 
Table 3.1. Just as in simple regression, the variable u is the error term or disturbance. It contains 
factors other than x,, x», . . . , x, that affect y. No matter how many explanatory variables we include 
in our model, there will always be factors we cannot include, and these are collectively contained in u. 

When applying the general multiple regression model, we must know how to interpret the param- 
eters. We will get plenty of practice now and in subsequent chapters, but it is useful at this point to be 
reminded of some things we already know. Suppose that CEO salary (salary) is related to firm sales 
(sales) and CEO tenure (ceoten) with the firm by 


log(salary) = By + B,log(sales) + B.ceoten + Bzceoten + u. [3.7] 


This fits into the multiple regression model (with k = 3) by defining y = log(salary), x; = log(sales), 
xX, = ceoten, and x; = ceoten*. As we know from Chapter 2, the parameter 6; is the ceteris paribus 


TABLE 3.1 Terminology for Multiple Regression 


y Xis Xo, -- -3 Xk 
Dependent variable Independent variables 
Explained variable Explanatory variables 
Response variable Control variables 
Predicted variable Predictor variables 
Regressand Regressors 
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elasticity of salary with respect to sales. If B, = 0, then 1006, is approximately the ceteris 
paribus percentage increase in salary when ceoten increases by one year. When B; # 0, the 
effect of ceoten on salary is more complicated. We will postpone a detailed treatment of general mod- 
els with quadratics until Chapter 6. 

Equation (3.7) provides an important reminder about multiple regression analysis. The term “lin- 
ear” in a multiple linear regression model means that equation (3.6) is linear in the parameters, p. 
Equation (3.7) is an example of a multiple regression model that, while linear in the 6;, is a nonlinear 
relationship between salary and the variables sales and ceoten. Many applications of multiple linear 
regression involve nonlinear relationships among the underlying variables. 

The key assumption for the general multiple regression model is easy to state in terms of a con- 
ditional expectation: 


E(ulx), xX»... , X4) = 0. [3.8] 


At a minimum, equation (3.8) requires that all factors in the unobserved error term be uncorrelated 
with the explanatory variables. It also means that we have correctly accounted for the functional rela- 
tionships between the explained and explanatory variables. Any problem that causes u to be correlated 
with any of the independent variables causes (3.8) to fail. In Section 3-3, we will show that assump- 
tion (3.8) implies that OLS is unbiased and will derive the bias that arises when a key variable has 
been omitted from the equation. In Chapters 15 and 16, we will study other reasons that might cause 
(3.8) to fail and show what can be done in cases where it does fail. 


3-2 Mechanics and Interpretation of Ordinary Least Squares 


We now summarize some computational and algebraic features of the method of ordinary least squares 
as it applies to a particular set of data. We also discuss how to interpret the estimated equation. 


3-2a Obtaining the OLS Estimates 


We first consider estimating the model with two independent variables. The estimated OLS equation 
is written in a form similar to the simple regression case: 


j= Bo a ees + Box, [3.9] 
where 


Bo = the estimate of Bp. 
bı = the estimate of B,. 
B, = the estimate of p. 


But how do we obtain Bos By and By? The method of ordinary least squares chooses the esti- 
mates to minimize the sum of squared residuals. That is, given n observations on y, xı, and x, 
{(%, X2 yi: i = 1,2,...,n}, the estimates By, 1, and B, are chosen simultaneously to make 


(i — Bo — Bixa — Ban) [3.10] 
as small as possible. 

To understand what OLS is doing, it is important to master the meaning of the indexing of the 
independent variables in (3.10). The independent variables have two subscripts here, i followed by 
either 1 or 2. The 7 subscript refers to the observation number. Thus, the sum in (3.10) is over all 
i = | ton observations. The second index is simply a method of distinguishing between different 
independent variables. In the example relating wage to educ and exper, x = educ; is education 
for person i in the sample and xp = exper; is experience for person i. The sum of squared residu- 
als in equation (3.10) is 5_, (wage, — By — Byeduc; — B.exper;). In what follows, the i subscript 
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is reserved for indexing the observation number. If we write Xij then this means the i” observation 
on the j independent variable. (Some authors prefer to switch the order of the observation num- 
ber and the variable number, so that x,; is observation i on variable one. But this is just a matter of 


notational taste.) 


In the general case with k independent variables, we seek estimates, Bos B igh By in the equation 
5 = Bo + Bis + Bory +--+ Bore [3.11] 
The OLS estimates, k + 1 of them, are chosen to minimize the sum of squared residuals: 
> (yi Êo Baxi a Bixa). [3.12] 
This minimization problem can be solved using multivariable calculus (see Appendix 3A). This leads 
to k + 1 linear equations in k + 1 unknowns Bo, Bi, ..., Be: 
> (yi Êo Bixa _ Bexa) =0 
DxO Êo Bixa gi Bexa) =0 
Davi Bo Bixi suse Bux) = 0 [3.13] 
Dui Bo Êa _ Bixa) =0. 


These are often called the OLS first order conditions. As with the simple regression model in 
Section 2-2, the OLS first order conditions can be obtained by the method of moments: under assump- 
tion (3.8), E(u) = 0 and E(x) = 0, where j = 1, 2, .. . , k. The equations in (3.13) are the sample 
counterparts of these population moments, although we have omitted the division by the sample 
size n. 

For even moderately sized n and k, solving the equations in (3.13) by hand calculations is tedious. 
Nevertheless, modern computers running standard statistics and econometrics software can solve 
these equations with large n and k very quickly. 

There is only one slight caveat: we must assume that the equations in (3.13) can be solved 
uniquely for the Ê; For now, we just assume this, as it is usually the case in well-specified models. 
In Section 3-3, we state the assumption needed for unique OLS estimates to exist (see Assumption 
MLR.3). 

As in simple regression analysis, equation (3.11) is called the OLS regression line or the sample 
regression function (SRF). We will call Bo the OLS intercept estimate and Bi. Poa Ê: the OLS 
slope estimates (corresponding to the independent variables x), X2, . . . , X4). 

To indicate that an OLS regression has been run, we will either write out equation (3.11) with y 
and xı, . . . , x, replaced by their variable names (such as wage, educ, and exper), or we will say that 
“we ran an OLS regression of y on xi, X2,..., x,” or that “we regressed y on xX), X2, . . . , X% These 
are shorthand for saying that the method of ordinary least squares was used to obtain the OLS 
equation (3.11). Unless explicitly stated otherwise, we always estimate an intercept along with the 
slopes. 


3-2b Interpreting the OLS Regression Equation 


More important than the details underlying the computation of the Ê; is the interpretation of the 
estimated equation. We begin with the case of two independent variables: 


$ = By + Bix, + Bor. [3.14] 
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The intercept Bo in equation (3.14) is the predicted value of y when x, = 0 and x, = 0. Sometimes, 
setting x, and x, both equal to zero is an interesting scenario; in other cases, it will not make sense. 
Nevertheless, the intercept is always needed to obtain a prediction of y from the OLS regression line, 
as (3.14) makes clear. 

The estimates B ı and Bo have partial effect, or ceteris paribus, interpretations. From equation 
(3.14), we have 


Aĵ = B\Ax, + B.An, 


so we can obtain the predicted change in y given the changes in x, and x2. (Note how the intercept has 
nothing to do with the changes in y.) In particular, when x, is held fixed, so that Ax, = 0, then 


Ay = Bi Axi, 


holding x, fixed. The key point is that, by including x, in our model, we obtain a coefficient on x, with 
a ceteris paribus interpretation. This is why multiple regression analysis is so useful. Similarly, 


Ay = BAx, 
holding x, fixed. 


Determinants of College GPA 


The variables in GPA1 include the college grade point average (colGPA), high school GPA (hsGPA), 
and achievement test score (ACT) for a sample of 141 students from a large university; both college 
and high school GPAs are on a four-point scale. We obtain the following OLS regression line to pre- 
dict college GPA from high school GPA and achievement test score: 


—— A 
colGPA = 1.29 + .453 hsGPA + .0094 ACT 
[3.15] 
n = 141. 
How do we interpret this equation? First, the intercept 1.29 is the predicted college GPA if hsGPA and 
ACT are both set as zero. Because no one who attends college has either a zero high school GPA or a 
zero on the achievement test, the intercept in this equation is not, by itself, meaningful. 

More interesting estimates are the slope coefficients on hsGPA and ACT. As expected, there is a positive 
partial relationship between colGPA and hsGPA: holding ACT fixed, another point on hsGPA is associated 
with .453 of a point on the college GPA, or almost half a point. In other words, if we choose two students, 
A and B, and these students have the same ACT score, but the high school GPA of Student A is one point 
higher than the high school GPA of Student B, then we predict Student A to have a college GPA .453 higher 
than that of Student B. (This says nothing about any two actual people, but it is our best prediction.) 

The sign on ACT implies that, while holding hsGPA fixed, a change in the ACT score of 10 points— 
a very large change, as the maximum ACT score is 36 and the average score in the sample is about 24 
with a standard deviation less than three—affects colGPA by less than one-tenth of a point. This is a 
small effect, and it suggests that, once high school GPA is accounted for, the ACT score is not a strong 
predictor of college GPA. (Naturally, there are many other factors that contribute to GPA, but here we 
focus on statistics available for high school students.) Later, after we discuss statistical inference, we 
will show that not only is the coefficient on ACT practically small, it is also statistically insignificant. 

If we focus on a simple regression analysis relating colGPA to ACT only, we obtain 


Fenn ee 
colGPA = 2.40 + .0271 ACT 
n= 141; 


thus, the coefficient on ACT is almost three times as large as the estimate in (3.15). But this equation 
does not allow us to compare two people with the same high school GPA; it corresponds to a different 
experiment. We say more about the differences between multiple and simple regression later. 
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The case with more than two independent variables is similar. The OLS regression line is 
ĵ$= Êo + Bux, E Box, Tona Bx [3.16] 
Written in terms of changes, 
Aĵ = BAx, + B,Ax, +- + Bx, [3.17] 


The coefficient on x, measures the change in } due to a one-unit increase in x,, holding all other inde- 
pendent variables fixed. That is, 


Aĵ = B,Ax, [3.18] 


holding xX), x3, . . . , x, fixed. Thus, we have controlled for the variables x, x3, . . . , x, when estimating 
the effect of x, on y. The other coefficients have a similar interpretation. 
The following is an example with three independent variables. 


Hourly Wage Equation 


Using the 526 observations on workers in WAGE1, we include educ (years of education), exper (years 
of labor market experience), and tenure (years with the current employer) in an equation explaining 
log(wage). The estimated equation is 


log(wage) = .284 + .092 educ + .0041 exper + .022 tenure 


3.19 
n = 526. - ! 


As in the simple regression case, the coefficients have a percentage interpretation. The only difference 
here is that they also have a ceteris paribus interpretation. The coefficient .092 means that, holding 
exper and tenure fixed, another year of education is predicted to increase log(wage) by .092, which 
translates into an approximate 9.2% [100(.092)] increase in wage. Alternatively, if we take two people 
with the same levels of experience and job tenure, the coefficient on educ is the proportionate dif- 
ference in predicted wage when their education levels differ by one year. This measure of the return 
to education at least keeps two important productivity factors fixed; whether it is a good estimate of 
the ceteris paribus return to another year of education requires us to study the statistical properties of 
OLS (see Section 3-3). 


3-2c On the Meaning of “Holding Other Factors Fixed” 
in Multiple Regression 


The partial effect interpretation of slope coefficients in multiple regression analysis can cause some 
confusion, so we provide a further discussion now. 

In Example 3.1, we observed that the coefficient on ACT measures the predicted difference in 
colGPA, holding hsGPA fixed. The power of multiple regression analysis is that it provides this ceteris 
paribus interpretation even though the data have not been collected in a ceteris paribus fashion. In giv- 
ing the coefficient on ACT a partial effect interpretation, it may seem that we actually went out and 
sampled people with the same high school GPA but possibly with different ACT scores. This is not the 
case. The data are a random sample from a large university: there were no restrictions placed on the 
sample values of hsGPA or ACT in obtaining the data. Rarely do we have the luxury of holding certain 
variables fixed in obtaining our sample. Zf we could collect a sample of individuals with the same high 
school GPA, then we could perform a simple regression analysis relating colGPA to ACT. Multiple 
regression effectively allows us to mimic this situation without restricting the values of any indepen- 
dent variables. 
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The power of multiple regression analysis is that it allows us to do in nonexperimental environ- 
ments what natural scientists are able to do in a controlled laboratory setting: keep other factors fixed. 


3-2d Changing More Than One Independent Variable 
Simultaneously 


Sometimes, we want to change more than one independent variable at the same time to find the result- 
ing effect on the dependent variable. This is easily done using equation (3.17). For example, in equa- 
tion (3.19), we can obtain the estimated effect on wage when an individual stays at the same firm for 
another year: exper (general workforce experience) and tenure both increase by one year. The total 
effect (holding educ fixed) is 


m a 
Alog(wage) = .0041 Aexper + .022 Atenure = .0041 + .022 = .0261, 


or about 2.6%. Because exper and tenure each increase by one year, we just add the coefficients on 
exper and tenure and multiply by 100 to turn the effect into a percentage. 


3-2e OLS Fitted Values and Residuals 


After obtaining the OLS regression line (3.11), we can obtain a fitted or predicted value for each 
observation. For observation į, the fitted value is simply 


ĵi = Êo T Bixa F Baxo erap Bixio [3.20] 


which is just the predicted value obtained by plugging the values of the independent variables for 
observation 7 into equation (3.11). We should not forget about the intercept in obtaining the fitted 
values; otherwise, the answer can be very misleading. As an example, if in (3.15), hsGPA; = 3.5 and 
ACT, = 24, colGPA,; = 1.29 + .453(3.5) + .0094(24) = 3.101 (rounded to three places after the 
decimal). 

Normally, the actual value y; for any observation i will not equal the predicted value, 4;: 
OLS minimizes the average squared prediction error, which says nothing about the prediction 
error for any particular observation. The residual for observation i is defined just as in the simple 
regression case, 


i; = yi; — ĵi [3.21] 


There is a residual for each observation. If a; > 0, then ĵ; is below y;, which means that, for this 
observation, y; is underpredicted. If i; < 0, then y; < ĵ;, and y; is overpredicted. 

The OLS fitted values and residuals have some important properties that are immediate exten- 
sions from the single variable case: 


1. The sample average of the residuals is zero and so y = §. 


k, GOING FURTHER 3.2 2. The sample covariance between each independent variable and 


In Example 3.1, the OLS fitted line explain- 
ing college GPA in terms of high school GPA 
and ACT score is 


a 
colGPA = 1.29 + .453 hsGPA 
+ .0094 ACT. 


If the average high school GPA is about 
3.4 and the average ACT score is about 
24.2, what is the average college GPA in the 
sample? 


the OLS residuals is zero. Consequently, the sample covariance 
between the OLS fitted values and the OLS residuals is zero. 


3. The point (x), X),..., Xp Y) is always on the OLS regression 
line: y = Bo + Bix, + Box. +- + Ber. 


The first two properties are immediate consequences of the set 
of equations used to obtain the OLS estimates. The first equation in 
(3.13) says that the sum of the residuals is zero. The remaining equa- 
tions are of the form Dia 1 Xgl; = 0, which implies that each inde- 
pendent variable has zero sample covariance with û;. Property (3) 
follows immediately from property (1). 
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3-2f A “Partialling Out” Interpretation of Multiple Regression 


When applying OLS, we do not need to know explicit formulas for the Ê; that solve the system of 
equations in (3.13). Nevertheless, for certain derivations, we do need explicit formulas for the Ê; 
These formulas also shed further light on the workings of OLS. 

Consider again the case with k = 2 independent variables, y= Bo + Bux, + Bax. For concrete- 
ness, we focus on Bi. One way to express By i is 


Ê = (Sina) [3.22] 


where the 7, are the OLS residuals from a simple regression of x, on x, using the sample at hand. We 
regress our first independent variable, xı, on our second independent variable, x, and then obtain the 
residuals (y plays no role here). Equation G. 22) shows that we can then do a simple regression of y 
on 7, to obtain B 1: (Note that the residuals 7;, have a zero sample average, and so B , is the usual slope 
estimate from simple regression.) 

The representation in equation (3.22) gives another demonstration of By’s partial effect inter- 
pretation. The residuals 7;, are the part of x; that is uncorrelated with x;.. Another way of saying this 
is that 7; is x; after the effects of xp have been partialled out, or netted out. Thus, Êi measures the 
sample relationship between y and x, after x, has been partialled out. 

In simple regression analysis, there is no partialling out of other variables because no other vari- 
ables are included in the regression. Computer Exercise C5 steps you through the partialling out pro- 
cess using the wage data from Example 3.2. For practical purposes, the important thing is that Bii in 
the equation ĵ = Bo + B x, + Boxy measures the change in y given a one-unit increase in x,, holding 
X fixed. 

In the general model with k explanatory variables, B ı can still be written as in equation (3.22), but 
the residuals 7;, come from the regression of x; on x, . . . , Xy. Thus, Êi measures the effect of x, on y 
after x2, . . . , x, have been partialled or netted out. In econometrics, the general partialling out result is 
usually called the Frisch-Waugh theorem. It has many uses in theoretical and applied econometrics. 
We will see applications to time series regressions in Chapter 10. 


3-2g Comparison of Simple and Multiple Regression Estimates 


Two special cases exist in which the simple regression of y on x, will produce the same OLS estimate 
on x, as the regression of y on x, and x. To be more precise, write the simple regression of y on x, as 
y = By + B,x,, and write the multiple regression as ĵ = Bo + Bux; + Box. We know that the simple 
regression coefficient B, does not usually equal the multiple regression coefficient Bi. It turns out 
there is a simple relationship between 6, and Bi, which allows for interesting comparisons between 
simple and multiple regression: 


Bi = B, + Bd, [3.23] 
where 6, is the slope coefficient from the simple regression of xp on xa, i = 1, ..., n. This equation 
shows how f, differs from the partial effect of x, on ĵ. The confounding term is the partial effect of 
xX on ĵ times the slope in the simple regression of x, on xı. (See Section 3A-4 in the chapter appendix 


for a more general verification.) 
The relationship between 6, and 6, also shows there are two distinct cases where they are equal: 


1. The partial effect of x, on ĵ is zero in the sample. That is, Bo = 0. 
2. x, and x, are uncorrelated in the sample. That is, 5, = 0. 


Even though simple and multiple regression estimates are almost never identical, we can use 
the above formula to characterize why they might be either very different or quite similar. For exam- 
ple, if B, is small, we might expect the multiple and simple regression estimates of 8, to be similar. 
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In Example 3.1, the sample correlation between hsGPA and ACT is about .346, which is a nontrivial 
correlation. But the coefficient on ACT is pretty small. Therefore, it is not surprising to find that the 
simple regression of colGPA on hsGPA produces a slope estimate of .482, which is not much different 
from the estimate .453 in (3.15). 


Participation in 401(k) Pension Plans 


We use the data in 401K to estimate the effect of a plan’s match rate (mrate) on the participation rate 
(prate) in its 401(k) pension plan. The match rate is the amount the firm contributes to a worker’s 
fund for each dollar the worker contributes (up to some limit); thus, mrate = .75 means that the firm 
contributes 75¢ for each dollar contributed by the worker. The participation rate is the percentage of 
eligible workers having a 401(k) account. The variable age is the age of the 401(k) plan. There are 
1,534 plans in the data set, the average prate is 87.36, the average mrate is .732, and the average age 


is 13.2. 
Regressing prate on mrate, age gives 
Cie 
prate = 80.12 + 5.52 mrate + .243 age 
n = 1,534. 


Thus, both mrate and age have the expected effects. What happens if we do not control for age? 
The estimated effect of age is not trivial, and so we might expect a large change in the estimated 
effect of mrate if age is dropped from the regression. However, the simple regression of prate on mrate 
yields prate = 83.08 + 5.86 mrate. The simple regression estimate of the effect of mrate on prate is 
clearly different from the multiple regression estimate, but the difference is not very big. (The sim- 
ple regression estimate is only about 6.2% larger than the multiple regression estimate.) This can be 
explained by the fact that the sample correlation between mrate and age is only .12. 


In the case with k independent variables, the simple regression of y on x, and the multiple regres- 
sion of y on x, X2, . . - , X, produce an identical estimate of x, only if (1) the OLS coefficients on x, 
through x, are all zero or (2) x, is uncorrelated with each of x, . . . , x, Neither of these is very likely 
in practice. But if the coefficients on x, through x, are small, or the sample correlations between x, 
and the other independent variables are insubstantial, then the simple and multiple regression esti- 
mates of the effect of x, on y can be similar. 


3-2h Goodness-of-Fit 


As with simple regression, we can define the total sum of squares (SST), the explained sum of 
squares (SSE), and the residual sum of squares or sum of squared residuals (SSR) as 


SST = $ (y: - y) [3.24] 
=i 

SSE = $ (9; — y)? [3.25] 
i=l 

SSR = Sa? [3.26] 


Using the same argument as in the simple regression case, we can show that 
SST = SSE + SSR. [3.27] 


In other words, the total variation in {y;} is the sum of the total variations in {f,} and in {a}. 
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Assuming that the total variation in y is nonzero, as is the case unless y; is constant in the sample, 
we can divide (3.27) by SST to get 


SSR/SST + SSE/SST = 1. 
Just as in the simple regression case, the R-squared is defined to be 
R? = SSE/SST = 1 — SSR/SST, [3.28] 


and it is interpreted as the proportion of the sample variation in y; that is explained by the OLS regres- 
sion line. By definition, R? is a number between zero and one. 

R’ can also be shown to equal the squared correlation coefficient between the actual y; and the 
fitted values },. That is, 


PT 
— 
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[3.29] 


[We have put the average of the ĵ; in (3.29) to be true to the formula for a correlation coefficient; 
we know that this average equals y because the sample average of the residuals is zero and 
yi = Îi + ü] 


Determinants of College GPA 
From the grade point average regression that we did earlier, the equation with R? is 
— 
colGPA = 1.29 + .453 hsGPA + .0094 ACT 
n = 141, R? = .176. 


This means that hsGPA and ACT together explain about 17.6% of the variation in college GPA 
for this sample of students. This may not seem like a high percentage, but we must remember that 
there are many other factors—including family background, personality, quality of high school 
education, affinity for college—that contribute to a student’s college performance. If hsGPA and ACT 
explained almost all of the variation in colGPA, then performance in college would be preordained by 
high school performance! 


An important fact about R? is that it never decreases, and it usually increases, when another inde- 
pendent variable is added to a regression and the same set of observations is used for both regressions. 
This algebraic fact follows because, by definition, the sum of squared residuals never increases when 
additional regressors are added to the model. For example, the last digit of one’s social security num- 
ber has nothing to do with one’s hourly wage, but adding this digit to a wage equation will increase 
the R? (by a little, at least). 

An important caveat to the previous assertion about R-squared is that it assumes we do not have 
missing data on the explanatory variables. If two regressions use different sets of observations, then, 
in general, we cannot tell how the R-squareds will compare, even if one regression uses a subset of 
regressors. For example, suppose we have a full set of data on the variables y, xı, and xy, but for some 
units in our sample data are missing on x3. Then we cannot say that the R-squared from regressing y 
on xı, X will be less than that from regressing y on xı, x2, and x3: it could go either way. Missing data 
can be an important practical issue, and we will return to it in Chapter 9. 

The fact that R? never decreases when any variable is added to a regression makes it a poor tool 
for deciding whether one variable or several variables should be added to a model. The factor that 
should determine whether an explanatory variable belongs in a model is whether the explanatory 


78 PART1 Regression Analysis with Cross-Sectional Data 


variable has a nonzero partial effect on y in the population. We will show how to test this hypoth- 
esis in Chapter 4 when we cover statistical inference. We will also see that, when used properly, R? 
allows us to test a group of variables to see if it is important for explaining y. For now, we use it as a 
goodness-of-fit measure for a given model. 


Explaining Arrest Records 


CRIME! contains data on arrests during the year 1986 and other information on 2,725 men born in 
either 1960 or 1961 in California. Each man in the sample was arrested at least once prior to 1986. 
The variable narr&6 is the number of times the man was arrested during 1986: it is zero for most 
men in the sample (72.29%), and it varies from 0 to 12. (The percentage of men arrested once during 
1986 was 20.51.) The variable pcnv is the proportion (not percentage) of arrests prior to 1986 that led 
to conviction, avgsen is average sentence length served for prior convictions (zero for most people), 
ptime8&6 is months spent in prison in 1986, and gemp86 is the number of quarters during which the 
man was employed in 1986 (from zero to four). 
A linear model explaining arrests is 


narr&6 = By + Bipcnv + Boavgsen + B3ptimeS6 + Bygemps6 + u, 


where pcnv is a proxy for the likelihood for being convicted of a crime and avgsen is a measure of 
expected severity of punishment, if convicted. The variable ptime86 captures the incarcerative effects 
of crime: if an individual is in prison, he cannot be arrested for a crime outside of prison. Labor mar- 
ket opportunities are crudely captured by qemp86. 

First, we estimate the model without the variable avgsen. We obtain 


—_——_~ 
narr86 = .712 — .150 pcnv — .034 ptime&6 — .104 gemp8&6 
2,725, R? = 0413, 


n 


This equation says that, as a group, the three variables pcnv, ptime86, and qemp86 explain about 4.1% 
of the variation in narrs6. 

Each of the OLS slope coefficients has the anticipated sign. An increase in the proportion of 
convictions lowers the predicted number of arrests. If we increase peny by .50 (a large increase in the 
probability of conviction), then, holding the other factors fixed, Anarr86 = —.150(.50) = —.075. 
This may seem unusual because an arrest cannot change by a fraction. But we can use this value to 
obtain the predicted change in expected arrests for a large group of men. For example, among 100 men, 
the predicted fall in arrests when pcnv increases by .50 is —7.5. 

Similarly, a longer prison term leads to a lower predicted number of arrests. In fact, if ptimeS6 
increases from 0 to 12, predicted arrests for a particular man fall by .034(12) = .408. Another quarter 
in which legal employment is reported lowers predicted arrests by .104, which would be 10.4 arrests 
among 100 men. 

If avgsen is added to the model, we know that R? will increase. The estimated equation is 


—_——~ 
narr86 = .707 — 151 penv + .0074 avgsen — .037 ptimeS6 — .103 gemp8s6 
n = 2,725, R? = 0422, 


Thus, adding the average sentence variable increases R? from .0413 to .0422, a practically small 
effect. The sign of the coefficient on avgsen is also unexpected: it says that a longer average sentence 
length increases criminal activity. 


Example 3.5 deserves a final word of caution. The fact that the four explanatory variables included 
in the second regression explain only about 4.2% of the variation in narr86 does not necessarily 
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mean that the equation is useless. Even though these variables collectively do not explain much of 
the variation in arrests, it is still possible that the OLS estimates are reliable estimates of the ceteris 
paribus effects of each independent variable on narr86. As we will see, whether this is the case does 
not directly depend on the size of R*. Generally, a low R? indicates that it is hard to predict individual 
outcomes on y with much accuracy, something we study in more detail in Chapter 6. In the arrest 
example, the small R? reflects what we already suspect in the social sciences: it is generally very dif- 
ficult to predict individual behavior. 


3-2i Regression through the Origin 


Sometimes, an economic theory or common sense suggests that By should be zero, and so we should 
briefly mention OLS estimation when the intercept is zero. Specifically, we now seek an equation of 
the form 


¥ = Bim + Box +--+ Bore [3.30] 


where the symbol “~” over the estimates is used to distinguish them from the OLS estimates obtained 
along with the intercept [as in (3.11)]. In (3.30), when x, = 0,2) = 0,..., x, = 0, the predicted 
value is zero. In this case, B,..., B, are said to be the OLS estimates from the regression of y on 
X1,X2,..., Xy through the origin. 

The OLS estimates in (3.30), as always, minimize the sum of squared residuals, but with the 
intercept set at zero. You should be warned that the properties of OLS that we derived earlier no 
longer hold for regression through the origin. In particular, the OLS residuals no longer have a zero 
sample average. Further, if R? is defined as 1 — SSR/SST, where SST is given in (3.24) and SSR is 
now X ;-1(y; — Bixa — > — BX)’, then R? can actually be negative. This means that the sample 
average, y, “explains” more of the variation in the y; than the explanatory variables. Either we should 
include an intercept in the regression or conclude that the explanatory variables poorly explain y. To 
always have a nonnegative R-squared, some economists prefer to calculate R? as the squared correla- 
tion coefficient between the actual and fitted values of y, as in (3.29). (In this case, the average fitted 
value must be computed directly because it no longer equals y.) However, there is no set rule on com- 
puting R-squared for regression through the origin. 

One serious drawback with regression through the origin is that, if the intercept 6, in the popula- 
tion model is different from zero, then the OLS estimators of the slope parameters will be biased. The 
bias can be severe in some cases. The cost of estimating an intercept when 6 is truly zero is that the 
variances of the OLS slope estimators are larger. 


3-3 The Expected Value of the OLS Estimators 


We now turn to the statistical properties of OLS for estimating the parameters in an underlying 
population model. In this section, we derive the expected value of the OLS estimators. In particu- 
lar, we state and discuss four assumptions, which are direct extensions of the simple regression 
model assumptions, under which the OLS estimators are unbiased for the population parameters. 
We also explicitly obtain the bias in OLS when an important variable has been omitted from the 
regression. 

You should remember that statistical properties have nothing to do with a particular sample, but 
rather with the property of estimators when random sampling is done repeatedly. Thus, Sections 3-3, 3-4, 
and 3-5 are somewhat abstract. Although we give examples of deriving bias for particular models, it 
is not meaningful to talk about the statistical properties of a set of estimates obtained from a single 
sample. 

The first assumption we make simply defines the multiple linear regression (MLR) model. 
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Assumption MLR.1 Linear in Parameters 


The model in the population can be written as 
Y = Bo + Bix + Boxe +--+ ByX, + U, [3.31] 


where Bo, By,..-, Bk are the unknown parameters (constants) of interest and u is an unobserved 
random error or disturbance term. 


Equation (3.31) formally states the population model, sometimes called the true model, to allow 
for the possibility that we might estimate a model that differs from (3.31). The key feature is that the 
model is linear in the parameters By, B;,... , By. AS we know, (3.31) is quite flexible because y and 
the independent variables can be arbitrary functions of the underlying variables of interest, such as 
natural logarithms and squares [see, for example, equation (3.7)]. 


Assumption MLR.2 Random Sampling 


We have a random sample of n observations, {(Xj1, Xin, - - +s Xie Yi i= 1, 2, ..., N}, following the 
population model in Assumption MLR.1. 


Sometimes, we need to write the equation for a particular observation i: for a randomly drawn 
observation from the population, we have 


Yi = Bo + Bixa + Borin b+ + BeXig + Up [3.32] 


Remember that i refers to the observation, and the second subscript on x is the variable number. For 
example, we can write a CEO salary equation for a particular CEO i as 


log(salary;) = By + B,log(sales;) + B,ceoten, + B,ceoten? + u; [3.33] 


The term u; contains the unobserved factors for CEO i that affect his or her salary. For applications, 
it is usually easiest to write the model in population form, as in (3.31). It contains less clutter and 
emphasizes the fact that we are interested in estimating a population relationship. 

In light of model (3.31), the OLS estimators Bos Êi, Bo, ion , Ê: from the regression of y on 
X,,...,2X, are now considered to be estimators of Bo, B),..., By. In Section 3-2, we saw that OLS 
chooses the intercept and slope estimates for a particular sample so that the residuals average to zero 
and the sample correlation between each independent variable and the residuals is zero. Still, we did 
not include conditions under which the OLS estimates are well defined for a given sample. The next 
assumption fills that gap. 


Assumption MLR.3 No Perfect Collinearity 


In the sample (and therefore in the population), none of the independent variables is constant, and there 
are no exact linear relationships among the independent variables. 


Assumption MLR.3 is more complicated than its counterpart for simple regression because we must 
now look at relationships between all independent variables. If an independent variable in (3.31) is 
an exact linear combination of the other independent variables, then we say the model suffers from 
perfect collinearity, and it cannot be estimated by OLS. 

It is important to note that Assumption MLR.3 does allow the independent variables to be cor- 
related; they just cannot be perfectly correlated. If we did not allow for any correlation among the 
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independent variables, then multiple regression would be of very limited use for econometric analysis. 
For example, in the model relating test scores to educational expenditures and average family income, 


avgscore = By + expend + B,avginc + u, 


we fully expect expend and avginc to be correlated: school districts with high average family incomes 
tend to spend more per student on education. In fact, the primary motivation for including avginc 
in the equation is that we suspect it is correlated with expend, and so we would like to hold it fixed 
in the analysis. Assumption MLR.3 only rules out perfect correlation between expend and avginc in 
our sample. We would be very unlucky to obtain a sample where per-student expenditures are per- 
fectly correlated with average family income. But some correlation, perhaps a substantial amount, is 
expected and certainly allowed. 

The simplest way that two independent variables can be perfectly correlated is when one vari- 
able is a constant multiple of another. This can happen when a researcher inadvertently puts the same 
variable measured in different units into a regression equation. For example, in estimating a relation- 
ship between consumption and income, it makes no sense to include as independent variables income 
measured in dollars as well as income measured in thousands of dollars. One of these is redundant. 
What sense would it make to hold income measured in dollars fixed while changing income measured 
in thousands of dollars? 

We already know that different nonlinear functions of the same variable can appear among 
the regressors. For example, the model cons = By) + B,inc + Bin? + u does not violate 
Assumption MLR.3: even though x, = inc’ is an exact function of x, = inc, inc? is not an exact linear 
function of inc. Including inc? in the model is a useful way to generalize functional form, unlike 
including income measured in dollars and in thousands of dollars. 

Common sense tells us not to include the same explanatory variable measured in different units 
in the same regression equation. There are also more subtle ways that one independent variable can 
be a multiple of another. Suppose we would like to estimate an extension of a constant elasticity con- 
sumption function. It might seem natural to specify a model such as 


log(cons) = By + B,log(inc) + B,log(inc?) + u, [3.34] 


where x, = log(inc) and x, = log(inc*). Using the basic properties of the natural log (see Math 
Refresher A), log(inc”) = 2-log(inc). That is, x, = 2x,, and naturally this holds for all observations 
in the sample. This violates Assumption MLR.3. What we should do instead is include [log(inc) )’, 
not log(inc”), along with log(inc). This is a sensible extension of the constant elasticity model, and 
we will see how to interpret such models in Chapter 6. 

Another way that independent variables can be perfectly collinear is when one independent vari- 
able can be expressed as an exact linear function of two or more of the other independent variables. 
For example, suppose we want to estimate the effect of campaign spending on campaign outcomes. 
For simplicity, assume that each election has two candidates. Let voteA be the percentage of the 
vote for Candidate A, let expendA be campaign expenditures by Candidate A, let expendB be cam- 
paign expenditures by Candidate B, and let totexpend be total campaign expenditures; the latter three 
variables are all measured in dollars. It may seem natural to specify the model as 


voteA = By + By,expendA + B,expendB + B3totexpend + u, [3.35] 


in order to isolate the effects of spending by each candidate and the total amount of spending. But this 
model violates Assumption MLR.3 because x; = x, + x, by definition. Trying to interpret this equa- 
tion in a ceteris paribus fashion reveals the problem. The parameter of 6, in equation (3.35) is sup- 
posed to measure the effect of increasing expenditures by Candidate A by one dollar on Candidate A’s 
vote, holding Candidate B’s spending and total spending fixed. This is nonsense, because if expendB 
and totexpend are held fixed, then we cannot increase expendA. 

The solution to the perfect collinearity in (3.35) is simple: drop any one of the three variables 
from the model. We would probably drop totexpend, and then the coefficient on expendA would 
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measure the effect of increasing expenditures by A on the percentage of the vote received by A, hold- 
ing the spending by B fixed. 

The prior examples show that Assumption MLR.3 can fail if we are not careful in specifying 
our model. Assumption MLR.3 also fails if the sample size, n, is too small in relation to the number 
of parameters being estimated. In the general regression model in equation (3.31), there are k + 1 
parameters, and MLR.3 fails if n < k + 1. Intuitively, this makes sense: to estimate k + 1 parameters, 
we need at least k + 1 observations. Not surprisingly, it is better to have as many observations as pos- 
sible, something we will see with our variance calculations in Section 3-4. 

If the model is carefully specified and n = k + 1, 


E GOING FURTHER 3.3 Assumption MLR.3 can fail in rare cases due to 


3 : bad luck in collecting the sample. For example, in 
In the previous example, if we use as explana- 


ti ith educati d i 
tory variables expendA, expendB, andshareA, | rabies, it is possible that we could obtain a ran- 
where shareA = 100-(expendA/totexpend) : P 


is the percentage share of total campaign dom sample where each individual has exactly twice 
expenditures made by Candidate A, does this | 48 much education as years of experience. This sce- 
violate Assumption MLR.3? nario would cause Assumption MLR.3 to fail, but it 
can be considered very unlikely unless we have an 
extremely small sample size. 

The final, and most important, assumption needed for unbiasedness is a direct extension of 
Assumption SLR.4. 


Assumption MLR.4 Zero Conditional Mean 


The error u has an expected value of zero given any values of the independent variables. In other words, 


E(ulxy, Xo,...,X%%) = 0. [3.36] 


One way that Assumption MLR.4 can fail is if the functional relationship between the explained and 
explanatory variables is misspecified in equation (3.31): for example, if we forget to include the qua- 
dratic term inc’ in the consumption function cons = By + B,inc + Bin? + u when we estimate the 
model. Another functional form misspecification occurs when we use the level of a variable when the 
log of the variable is what actually shows up in the population model, or vice versa. For example, if 
the true model has log(wage) as the dependent variable but we use wage as the dependent variable in 
our regression analysis, then the estimators will be biased. Intuitively, this should be pretty clear. We 
will discuss ways of detecting functional form misspecification in Chapter 9. 

Omitting an important factor that is correlated with any of x,, x,,...,x, causes Assumption 
MLR.4 to fail also. With multiple regression analysis, we are able to include many factors among 
the explanatory variables, and omitted variables are less likely to be a problem in multiple regression 
analysis than in simple regression analysis. Nevertheless, in any application, there are always factors 
that, due to data limitations or ignorance, we will not be able to include. If we think these factors 
should be controlled for and they are correlated with one or more of the independent variables, then 
Assumption MLR.4 will be violated. We will derive this bias later. 

There are other ways that u can be correlated with an explanatory variable. In Chapters 9 
and 15, we will discuss the problem of measurement error in an explanatory variable. In Chapter 16, 
we cover the conceptually more difficult problem in which one or more of the explanatory variables 
is determined jointly with y—as occurs when we view quantities and prices as being determined by 
the intersection of supply and demand curves. We must postpone our study of these problems until we 
have a firm grasp of multiple regression analysis under an ideal set of assumptions. 

When Assumption MLR.4 holds, we often say that we have exogenous explanatory variables. 
If x; is correlated with u for any reason, then x; is said to be an endogenous explanatory variable. 
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The terms “exogenous” and “endogenous” originated in simultaneous equations analysis (see 
Chapter 16), but the term “endogenous explanatory variable” has evolved to cover any case in which 
an explanatory variable may be correlated with the error term. 

Before we show the unbiasedness of the OLS estimators under MLR.1 to MLR.4, a word of cau- 
tion. Beginning students of econometrics sometimes confuse Assumptions MLR.3 and MLR.4, but 
they are quite different. Assumption MLR.3 rules out certain relationships among the independent or 
explanatory variables and has nothing to do with the error, u. You will know immediately when car- 
rying out OLS estimation whether or not Assumption MLR.3 holds. On the other hand, Assumption 
MLR.4—the much more important of the two—restricts the relationship between the unobserved 
factors in u and the explanatory variables. Unfortunately, we will never know for sure whether the 
average value of the unobserved factors is unrelated to the explanatory variables. But this is the criti- 
cal assumption. 

We are now ready to show unbiasedness of OLS under the first four multiple regression assump- 
tions. As in the simple regression case, the expectations are conditional on the values of the explana- 
tory variables in the sample, something we show explicitly in Appendix 3A but not in the text. 


14.14 UNBIASEDNESS OF OLS 
3.1 Under Assumptions MLR.1 through MLR.4, 


E(B) = B, j =0,1,...,k, [3.37] 


for any values of the population parameter B;. In other words, the OLS estimators are unbiased 
estimators of the population parameters. 


In our previous empirical examples, Assumption MLR.3 has been satisfied (because we have 
been able to compute the OLS estimates). Furthermore, for the most part, the samples are randomly 
chosen from a well-defined population. If we believe that the specified models are correct under the 
key Assumption MLR.4, then we can conclude that OLS is unbiased in these examples. 

Because we are approaching the point where we can use multiple regression in serious empirical 
work, it is useful to remember the meaning of unbiasedness. It is tempting, in examples such as the 
wage equation in (3.19), to say something like “9.2% is an unbiased estimate of the return to educa- 
tion.” As we know, an estimate cannot be unbiased: an estimate is a fixed number, obtained from a 
particular sample, which usually is not equal to the population parameter. When we say that OLS is 
unbiased under Assumptions MLR.1 through MLR.4, we mean that the procedure by which the OLS 
estimates are obtained is unbiased when we view the procedure as being applied across all possible 
random samples. We hope that we have obtained a sample that gives us an estimate close to the popu- 
lation value, but, unfortunately, this cannot be assured. What is assured is that we have no reason to 
believe our estimate is more likely to be too big or more likely to be too small. 


3-3a Including Irrelevant Variables in a Regression Model 


One issue that we can dispense with fairly quickly is that of inclusion of an irrelevant variable or 
overspecifying the model in multiple regression analysis. This means that one (or more) of the inde- 
pendent variables is included in the model even though it has no partial effect on y in the population. 
(That is, its population coefficient is zero.) 

To illustrate the issue, suppose we specify the model as 


y = Bo + Bix, + Box. + 3x3 + u, [3.38] 


and this model satisfies Assumptions MLR.1 through MLR.4. However, x, has no effect on y after 
x, and x, have been controlled for, which means that 6, = 0. The variable x, may or may not be 
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correlated with x, or x,; all that matters is that, once x, and x, are controlled for, x, has no effect on y. 
In terms of conditional expectations, E(y|x,, x2, x3) = E(y|x,,x») = Bo + Bix, + Bor. 
Because we do not know that 6; = 0, we are inclined to estimate the equation including x3: 


j= Êo + Bix + Box + poe [3.39] 


We have included the irrelevant variable, x3, in our regression. What is the effect of including x; in 
(3.39) when its coefficient in the population model (3.38) is zero? In terms of the unbiasedness of B, 
and Bos there is no effect. This conclusion requires no special derivation, as it follows immediately from 
Theorem 3.1. Remember, unbiasedness means E( Ê j) = B; for any value of B; including B; = 0. Thus, 
we can conclude that E(B) = Bo, E(B) = B,, E(B.) = B2, E(B3) = 0 (for any values of Bo, Bi, 
and 62). Even though Ê: itself will never be exactly zero, its average value across all random samples 
will be zero. 

The conclusion of the preceding example is much more general: including one or more irrelevant 
variables in a multiple regression model, or overspecifying the model, does not affect the unbiased- 
ness of the OLS estimators. Does this mean it is harmless to include irrelevant variables? No. As we 
will see in Section 3-4, including irrelevant variables can have undesirable effects on the variances of 
the OLS estimators. 


3-3b Omitted Variable Bias: The Simple Case 


Now suppose that, rather than including an irrelevant variable, we omit a variable that actually 
belongs in the true (or population) model. This is often called the problem of excluding a relevant 
variable or underspecifying the model. We claimed in Chapter 2 and earlier in this chapter that this 
problem generally causes the OLS estimators to be biased. It is time to show this explicitly and, just 
as importantly, to derive the direction and size of the bias. 

Deriving the bias caused by omitting an important variable is an example of misspecification 
analysis. We begin with the case where the true population model has two explanatory variables and 
an error term: 


y = Bo + Bix, + Box + u, [3.40] 


and we assume that this model satisfies Assumptions MLR. 1 through MLR.4. 

Suppose that our primary interest is in 6}, the partial effect of x, on y. For example, y is hourly 
wage (or log of hourly wage), x, is education, and x, is a measure of innate ability. In order to get an 
unbiased estimator of 8,, we should run a regression of y on x, and x, (which gives unbiased estima- 
tors of Bo, E1, and B,). However, due to our ignorance or data unavailability, we estimate the model by 
excluding x. In other words, we perform a simple regression of y on x, only, obtaining the equation 


y= Bo + Bixi: [3.41] 


We use the symbol “~” rather than “~” to emphasize that 8; comes from an underspecified model. 

When first learning about the omitted variable problem, it can be difficult to distinguish between 
the underlying true model, (3.40) in this case, and the model that we actually estimate, which is cap- 
tured by the regression in (3.41). It may seem silly to omit the variable x, if it belongs in the model, 
but often we have no choice. For example, suppose that wage is determined by 


wage = By + B,educ + pabil + u. [3.42] 


2 6699 


Because ability is not observed, we instead estimate the model 
wage = By) + B,educ + v, 


where v = Babil + u. The estimator of 6, from the simple regression of wage on educ is what we 
are calling B,. 
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We derive the expected value of B, conditional on the sample values of x, and x,. Deriving this 
expectation is not difficult because 8, is just the OLS slope estimator from a simple regression, and 
we have already studied this estimator extensively in Chapter 2. The difference here is that we must 
analyze its properties when the simple regression model is misspecified due to an omitted variable. 

As it turns out, we have done almost all of the work to derive the bias in the simple regression 
estimator of 6,. From equation (3.23) we have the algebraic relationship 6, = Êi T Bob, where Êi 
and Bo are the slope estimators (if we could have them) from the multiple regression 


y, ON Xi Xm 5 1,...,0n [3.43] 
and 6, is the slope from the simple regression 
Xj2 ON Xj}, i= 1, PETE 5 [3.44] 


Because 5, depends only on the independent variables in the sample, we treat it as fixed (nonrandom) 
when computing E(B, ). Further, because the model in (3.40) satisfies Assumptions MLR.1 through 
MLR.4, we know that B ı and Bo would be unbiased for 6, and £5, respectively. Therefore, 


E(B) = E(B, + Ê-ò1) = E(B,) + E(B) 5; 


~ [3.45] 
= Bi + Bo, 


which implies the bias in B, is 
Bias(B,) = E(B) = þr= B281. [3.46] 


Because the bias in this case arises from omitting the explanatory variable x, the term on the right- 
hand side of equation (3.46) is often called the omitted variable bias. 

From equation (3.46), we see that there are two cases where B, is unbiased. The first is pretty 
obvious: if B, = 0—so that x, does not appear in the true model (3.40)—then B, is unbiased. We 
already know this from the simple regression analysis in Chapter 2. The second case is more interest- 
ing. If 5, = 0, then f, is unbiased for 64, even if B, # 0. 

Because 6, is the sample covariance between x, and x, over the sample variance of x,, 6; = 0 
if, and only if, x, and x, are uncorrelated in the sample. Thus, we have the important conclusion that, 
if x, and x, are uncorrelated in the sample, then , is unbiased. This is not surprising: in Section 3-2, 
we showed that the simple regression estimator B, and the multiple regression estimator B, are the 
same when x, and x, are uncorrelated in the sample. [We can also show that B, is unbiased without 
conditioning on the xp if E(x,|x,) = E(x); then, for estimating 6,, leaving x, in the error term does 
not violate the zero conditional mean assumption for the error, once we adjust the intercept. ] 

When x, and x, are correlated, 6, has the same sign as the correlation between x, and x,: 5; > 0 
if x, and x, are positively correlated and 6, < 0 if x, and x, are negatively correlated. The sign of the 
bias in 6, depends on the signs of both B, and 5 and is summarized in Table 3.2 for the four possible 
cases when there is bias. Table 3.2 warrants careful study. For example, the bias in , is positive if 
By > 0 (x, has a positive effect on y) and x, and x, are positively correlated, the bias is negative if 
By > 0 and x, and x, are negatively correlated, and so on. 

Table 3.2 summarizes the direction of the bias, but the size of the bias is also very important. A 
small bias of either sign need not be a cause for concern. For example, if the return to education in the 
population is 8.6% and the bias in the OLS estimator is 0.1% (a tenth of one percentage point), then 


TABLE 3.2 Summary of Bias in Bı When x, Is Omitted in 


Estimating Equation (3.40) 

Corr(x,, X2) > 0 Corr(x,, X2) < 0 
B. > 0 Positive bias Negative bias 
Bo <0 Negative bias Positive bias 
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we would not be very concerned. On the other hand, a bias on the order of three percentage points 
would be much more serious. The size of the bias is determined by the sizes of B, and 6,. 

In practice, because 6, is an unknown population parameter, we cannot be certain whether B, is 
positive or negative. Nevertheless, we usually have a pretty good idea about the direction of the partial 
effect of x, on y. Further, even though the sign of the correlation between x, and x, cannot be known 
if x, is not observed, in many cases, we can make an educated guess about whether x, and x, are posi- 
tively or negatively correlated. 

In the wage equation (3.42), by definition, more ability leads to higher productivity and therefore 
higher wages: B, > 0. Also, there are reasons to believe that educ and abil are positively correlated: 
on average, individuals with more innate ability choose higher levels of education. Thus, the OLS 
estimates from the simple regression equation wage = By + B,educ + v are on average too large. 
This does not mean that the estimate obtained from our sample is too big. We can only say that if 
we collect many random samples and obtain the simple regression estimates each time, then the aver- 
age of these estimates will be greater than £}. 


Hourly Wage Equation 


Suppose the model log(wage) = By + B,educ + Babil + u satisfies Assumptions MLR.1 through 
MLR.4. The data set in WAGE does not contain data on ability, so we estimate 6, from the simple 
regression 


a 


log(wage) = .584 + .083 educ 


3.47 
n = 526, R? = .186. [3.47] 


This is the result from only a single sample, so we cannot say that .083 is greater than 64; the true 
return to education could be lower or higher than 8.3% (and we will never know for sure). Nevertheless, 
we know that the average of the estimates across all random samples would be too large. 


As a second example, suppose that, at the elementary school level, the average score for students 
on a standardized exam is determined by 


avgscore = By + B,expend + B,povrate + u, [3.48] 


where expend is expenditure per student and povrate is the poverty rate of the children in the school. 
Using school district data, we only have observations on the percentage of students with a passing 
grade and per-student expenditures; we do not have information on poverty rates. Thus, we estimate 
Bı from the simple regression of avgscore on expend. 

We can again obtain the likely bias in B,. First, B, is probably negative: there is ample evidence 
that children living in poverty score lower, on average, on standardized tests. Second, the average 
expenditure per student is probably negatively correlated with the poverty rate: the higher the poverty 
rate, the lower the average per-student spending, so that Corr(x,, x2) < 0. From Table 3.2, B, will 
have a positive bias. This observation has important implications. It could be that the true effect of 
spending is zero; that is, 8, = 0. However, the simple regression estimate of 6, will usually be greater 
than zero, and this could lead us to conclude that expenditures are important when they are not. 

When reading and performing empirical work in economics, it is important to master the termi- 
nology associated with biased estimators. In the context of omitting a variable from model (3.40), if 
E(6,) > Bı, then we say that 6, has an upward bias. When E(8,) < 6, 6, has a downward bias. 
These definitions are the same whether £; is positive or negative. The phrase biased toward zero 
refers to cases where E(,) is closer to zero than is 6. Therefore, if 8, is positive, then B, is biased 
toward zero if it has a downward bias. On the other hand, if 8, < 0, then B, is biased toward zero if it 
has an upward bias. 
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3-3c Omitted Variable Bias: More General Cases 


Deriving the sign of omitted variable bias when there are multiple regressors in the estimated model is 
more difficult. We must remember that correlation between a single explanatory variable and the error 
generally results in all OLS estimators being biased. For example, suppose the population model 


y = Bo + Bix, + Boxy + B3x3 + u [3.49] 
satisfies Assumptions MLR.1 through MLR.4. But we omit x; and estimate the model as 
Y = Bo + Bixi + Box. [3.50] 


Now, suppose that x, and x, are uncorrelated, but that x, is correlated with x3. In other words, x, is 
correlated with the omitted variable, but x, is not. It is tempting to think that, while B, is probably 
biased based on the derivation in the previous subsection, Ø, is unbiased because x, is uncorrelated 
with x3. Unfortunately, this is not generally the case: both B, and Ø, will normally be biased. The only 
exception to this is when x, and x, are also uncorrelated. 

Even in the fairly simple model above, it can be difficult to obtain the direction of bias in B; and 
B>. This is because x), x, and x; can all be pairwise correlated. Nevertheless, an approximation is 
often practically useful. If we assume that x, and x, are uncorrelated, then we can study the bias in B, 
as if x, were absent from both the population and the estimated models. In fact, when x, and x, are 
uncorrelated, it can be shown that 


This is just like equation (3.45), but 6, replaces $, and x3 replaces x, in regression (3.44). Therefore, 
the bias in B, is obtained by replacing B, with B, and x, with x; in Table 3.2. If B, > 0 and 
Corr (xı, x3) > 0, the bias in 8; is positive, and so on. 

As an example, suppose we add exper to the wage model: 


wage = By + B,educ + Byexper + Babil + u. 


If abil is omitted from the model, the estimators of both 6, and £, are biased, even if we assume exper 
is uncorrelated with abil. We are mostly interested in the return to education, so it would be nice if 
we could conclude that 8, has an upward or a downward bias due to omitted ability. This conclusion 
is not possible without further assumptions. As an approximation, let us suppose that, in addition to 
exper and abil being uncorrelated, educ and exper are also uncorrelated. (In reality, they are some- 
what negatively correlated.) Because B; > 0 and educ and abil are positively correlated, 6, would 
have an upward bias, just as if exper were not in the model. 

The reasoning used in the previous example is often followed as a rough guide for obtaining 
the likely bias in estimators in more complicated models. Usually, the focus is on the relationship 
between a particular explanatory variable, say, xı, and the key omitted factor. Strictly speaking, ignor- 
ing all other explanatory variables is a valid practice only when each one is uncorrelated with x,, but 
it is still a useful guide. Appendix 3A contains a more careful analysis of omitted variable bias with 
multiple explanatory variables. 


3-4 The Variance of the OLS Estimators 


We now obtain the variance of the OLS estimators so that, in addition to knowing the central ten- 
dencies of the £;, we also have a measure of the spread in its sampling distribution. Before finding 
the variances, we add a homoskedasticity assumption, as in Chapter 2. We do this for two reasons. 
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First, the formulas are simplified by imposing the constant error variance assumption. Second, in 
Section 3-5, we will see that OLS has an important efficiency property if we add the homoskedasticity 
assumption. 

In the multiple regression framework, homoskedasticity is stated as follows: 


Assumption MLR.5 Homoskedasticity 


The error u has the same variance given any value of the explanatory variables. In other words, 


WEI GG, oan doy) = Gr 


THEOREM 
3.2 


Assumption MLR.5 means that the variance in the error term, u, conditional on the explanatory vari- 
ables, is the same for all combinations of outcomes of the explanatory variables. If this assumption 
fails, then the model exhibits heteroskedasticity, just as in the two-variable case. 

In the equation 


wage = By + B,educ + Bexper + B3tenure + u, 


homoskedasticity requires that the variance of the unobserved error u does not depend on the levels of 
education, experience, or tenure. That is, 


Var(uleduc, exper, tenure) = o’. 


If this variance changes with any of the three explanatory variables, then heteroskedasticity is 
present. 

Assumptions MLR. 1 through MLR.5 are collectively known as the Gauss-Markov assumptions 
(for cross-sectional regression). So far, our statements of the assumptions are suitable only when 
applied to cross-sectional analysis with random sampling. As we will see, the Gauss-Markov assump- 
tions for time series analysis, and for other situations such as panel data analysis, are more difficult to 
state, although there are many similarities. 

In the discussion that follows, we will use the symbol x to denote the set of all independent 
variables, (xı, . . . , x,). Thus, in the wage regression with educ, exper, and tenure as independent vari- 
ables, x = (educ, exper, tenure). Then we can write Assumptions MLR.1 and MLR.4 as 


E(y|x) = Bo + Bix, + Boxy +--+ + Bry 


and Assumption MLR.5 is the same as Var(y|x) = a”. Stating the assumptions in this way clearly 
illustrates how Assumption MLR.5 differs greatly from Assumption MLR.4. Assumption MLR.4 
says that the expected value of y, given x, is linear in the parameters, but it certainly depends on 
X1,X9,...,X,. Assumption MLR.5 says that the variance of y, given x, does not depend on the values 
of the independent variables. 

We can now obtain the variances of the Ê, where we again condition on the sample values of the 
independent variables. The proof is in the appendix to this chapter. 


SAMPLING VARIANCES OF THE OLS SLOPE ESTIMATORS 


Under Assumptions MLR.1 through MLR.5, conditional on the sample values of the independent 
variables, 


o? 


SST(1 = R2)’ ii 


Var(ĝ) = 


for j= 1,2,...,k, where SST, = }7.,(x; — X)? is the total sample variation in x, and R? is the 
R-squared from regressing x; on all other independent variables (and including an intercept). 
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The careful reader may be wondering whether there is a simple formula for the variance of Ê; 
where we do not condition on the sample outcomes of the explanatory variables. The answer is: none 
that is useful. The formula in (3.51) is a highly nonlinear function of the x;;, making averaging out 
across the population distribution of the explanatory variables virtually impossible. Fortunately, for 
any practical purpose equation (3.51) is what we want. Even when we turn to approximate, large- 
sample properties of OLS in Chapter 5 it turns out that (3.51) estimates the quantity we need for 
large-sample analysis, provided Assumptions MLR.1 through MLR.5 hold. 

Before we study equation (3.51) in more detail, it is important to know that all of the Gauss- 
Markov assumptions are used in obtaining this formula. Whereas we did not need the homoskedastic- 
ity assumption to conclude that OLS is unbiased, we do need it to justify equation (3.51). 

The size of Var( Ê) is practically important. A larger variance means a less precise estimator, and 
this translates into larger confidence intervals and less accurate hypotheses tests (as we will see in 
Chapter 4). In the next subsection, we discuss the elements comprising (3.51). 


3-4a The Components of the OLS Variances: Multicollinearity 


Equation (3.51) shows that the variance of Ê; depends on three factors: o” , SST;, and R. Remember 
that the index j simply denotes any one of the independent variables (suchi as education 3 or poverty 
rate). We now consider each of the factors affecting Var( B)) i in turn. 


The Error Variance, o*. From equation (3.51), a larger o° means larger sampling variances 
for the OLS estimators. This is not at all surprising: more “noise” in the equation (a larger o°) makes it 
more difficult to estimate the partial effect of any of the independent variables on y, and this is reflected 
in higher variances for the OLS slope estimators. Because g? is a feature of the population, it has 
nothing to do with the sample size. It is the one component of (3.51) that is unknown. We will see later 
how to obtain an unbiased estimator of o°. 

For a given dependent variable y, there is really only one way to reduce the error variance, and 
that is to add more explanatory variables to the equation (take some factors out of the error term). 
Unfortunately, it is not always possible to find additional legitimate factors that affect y. 


The Total Sample Variation in Xj SST;. From equation (3.51), we see that the larger 
the total variation in x; is, the smaller is Var( Ê). Tus: everything else being equal, for estimating £; 
we prefer to have as nich sample variation in x; as possible. We already discovered this in the simple 
regression case in Chapter 2. Although it is rarely possible for us to choose the sample values of the 
independent variables, there is a way to increase the sample variation in each of the independent 
variables: increase the sample size. In fact, when one randomly samples from a population, SST; 
increases without bound as the sample size increases—roughly as a linear function of n. This is the 
component of the variance that systematically depends on the sample size. 

When SST; is small, Var( Ê) can get very large, but a small SST; is not a violation of Assumption 
MLR.3. Technically, as SST; goes to zero, Var( p) approaches infinity. The extreme case of no sam- 
ple variation in xj, SST; = 0, is not allowed by Assumption MLR.3 because then we cannot even 
compute the OLS estimates. 


The Linear Relationships among the Independent Variables, R?. The 
term R in equation (3.51) is the most difficult of the three components to understand. This term 
does not appear in simple regression analysis because there is only one independent variable in such 
cases. It is important to see that this R-squared is distinct from the R-squared in the regression of y on 
Kis Hon eo Ki R; is obtained from a regression involving only the independent variables in the original 
model, where x; plas the role of a dependent variable. 

Consider first the k = 2 case: y = By + B,x, + Boxy + u. Then, Var(f,) = o?/[SST,(1 — R7)], 
where Rj is the R-squared from the simple regression of x, on x, (and an intercept, as always). 
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FIGURE 3.1 Var(f,) as a function of R? 


Var(B,) 


Because the R-squared measures goodness-of-fit, a value of Rj close to one indicates that x, explains 
much of the variation in x, in the sample. This means that x, and x, are highly correlated. 

As Rj increases to one, Var( Ê) gets larger and larger. Thus, a high degree of linear relationship 
between x, and x, can lead to large variances for the OLS slope estimators. (A similar argument applies to 
Bo. ) See Figure 3.1 for the relationship between Var(B; ) and the R-squared from the regression of x, on x2. 

In the general case, R i is the proportion of the total variation us x; that can be explained by the 
other independent variables appearing in the equation. For a given o? aid SST,, the smallest Var( Ê) 
is obtained when R = = 0, which happens if, and only if, x; has zero sample cottelation with every 
other independent faite: This is the best case for estimating ;, but it is rarely encountered. 

The other extreme case, R? = 1, is ruled out by Assumption MLR.3, because R? = ] means that, 
in the sample, x; is a perfect liigat combination of some of the other independent variables in the 
regression. A more relevant case is when Ri is “close” to one. From equation (3.51) and Figure 3.1, 
we see that this can cause Var(B;) to be large: Var(B;) — c as R? — 1. High (but not perfect) cor- 
relation between two or more independent variables is called multicollinearity. 

Before we discuss the multicollinearity issue further, it is important to be very clear on one thing: 
a case where R? is close to one is not a violation of Assumption MLR.3. 

Because multicollinisaiiy violates none of our assumptions, the “problem” of multicollinearity is not 
really well defined. When we say that multicollinearity arises for estimating 6; when R; is “close” to one, 
we put “close” in quotation marks because there is no absolute number that we can cite to conclude that 
multicollinearity is a problem. For example, R = = .9 means that 90% of the sample variation in x; can be 
explained by the other independent variables in the regression model. Unquestionably, this means that 
x; has a strong linear relationship to the other independent vanai e, But whether this translates into a 
Var( Ê) that is too large to be useful depends on the sizes of o° and SST;. As we will see in Chapter 4, for 
statistical inference, what ultimately matters is how big Ê; }; is in relation to its standard deviation. 
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Just as a large value of R? can cause a large Var(ĝ,), so can a small value of SST;. Therefore, 
a small sample size can lead to large sampling variances, too. Worrying about high degrees of cor- 
relation among the independent variables in the sample is really no different from worrying about a 
small sample size: both work to increase Var( Ê). The famous University of Wisconsin econometri- 
cian Arthur Goldberger, reacting to econometricians’ obsession with multicollinearity, has (tongue in 
cheek) coined the term micronumerosity, which he defines as the “problem of small sample size.” 
[For an engaging discussion of multicollinearity and micronumerosity, see Goldberger (1991).] 

Although the problem of multicollinearity cannot be clearly defined, one thing is clear: every- 
thing else being equal, for estimating ;, it is better to have less correlation between x; and the other 
independent variables. This observation often leads to a discussion of how to “solve” the multicol- 
linearity problem. In the social sciences, where we are usually passive collectors of data, there is no 
good way to reduce variances of unbiased estimators other than to collect more data. For a given data 
set, we can try dropping other independent variables from the model in an effort to reduce multicol- 
linearity. Unfortunately, dropping a variable that belongs in the population model can lead to bias, as 
we saw in Section 3-3. 

Perhaps an example at this point will help clarify some of the issues raised concerning multicol- 
linearity. Suppose we are interested in estimating the effect of various school expenditure categories 
on student performance. It is likely that expenditures on teacher salaries, instructional materials, ath- 
letics, and so on are highly correlated: wealthier schools tend to spend more on everything, and poorer 
schools spend less on everything. Not surprisingly, it can be difficult to estimate the effect of any 
particular expenditure category on student performance when there is little variation in one category 
that cannot largely be explained by variations in the other expenditure categories (this leads to high R? 
for each of the expenditure variables). Such multicollinearity problems can be mitigated by collecting 
more data, but in a sense we have imposed the problem on ourselves: we are asking questions that 
may be too subtle for the available data to answer with any precision. We can probably do much bet- 
ter by changing the scope of the analysis and lumping all expenditure categories together, because we 
would no longer be trying to estimate the partial effect of each separate category. 

Another important point is that a high degree of correlation between certain independent vari- 
ables can be irrelevant as to how well we can estimate other parameters in the model. For example, 
consider a model with three independent variables: 


y = Bo + Bix, + Bx + B3x3 + u, 


where x, and x; are highly correlated. Then Var(,) and Var(ĝ,) may be large. But the amount 
of correlation between x, and x; has no direct effect on Var(,). In fact, if x, is uncorrelated with x, 
and x3, then R} = 0 and Var(B,) = o7/SST,, regard- 
a, GOING FURTHER 3.4 less of how much correlation there is between x, and x3. 
7X If 6; is the parameter of interest, we do not really care 
Suppose you postulate a model explaining : 
finak exam-score in terms of class anien- about the amount of correlation between x, and x3. 
dance. Thus, the dependent variable is 


The previous observation is important because 
final exam score, and the key explanatory | economists often include many control variables 


in order to isolate the causal effect of a particular 
variable. For example, in looking at the relation- 
ship between loan approval rates and percentage 
of minorities in a neighborhood, we might include 
variables like average income, average housing 
value, measures of creditworthiness, and so on, 
because these factors need to be accounted for in 
order to draw causal conclusions about discrimina- 
tion. Income, housing prices, and creditworthiness 
are generally highly correlated with each other. 


variable is number of classes attended. To 
control for student abilities and efforts out- 
side the classroom, you include among the 
explanatory variables cumulative GPA, SAT 
score, and measures of high school perfor- 
mance. Someone says, “You cannot hope 
to learn anything from this exercise because 
cumulative GPA, SAT score, and high school 
performance are likely to be highly collinear.” 
What should be your response? 
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But high correlations among these controls do not make it more difficult to determine the effects 
of discrimination. 

Some researchers find it useful to compute statistics intended to determine the severity of mul- 
ticollinearity in a given application. Unfortunately, it is easy to misuse such statistics because, as we 
have discussed, we cannot specify how much correlation among explanatory variables is “too much.” 
Some multicollinearity “diagnostics” are omnibus statistics in the sense that they detect a strong linear 
relationship among any subset of explanatory variables. For reasons that we just saw, such statistics 
are of questionable value because they might reveal a “problem” simply because two control variables, 
whose coefficients we do not care about, are highly correlated. [Probably the most common omnibus 
multicollinearity statistic is the so-called condition number, which is defined in terms of the full data 
matrix and is beyond the scope of this text. See, for example, Belsley, Kuh, and Welsh (1980).] 

Somewhat more useful, but still prone to misuse, are statistics for individual coefficients. The 
most common of these is the variance inflation factor (VIF), which is obtained directly from equa- 
tion (3.51). The VIF for slope coefficient j is simply VIF; = 1/(1 — RẸ), precisely the term in Var( B)) 
that is determined by correlation between x; and the other explanatory variables. We can write Var( Bi) 
in equation (3.51) as 


2 


SST, 


j 


Var(ĝ;) = - VIF, 


which shows that VIF; is the factor by which Var()) is higher because x; is not uncorrelated with 
the other explanatory variables. Because VIF, is a function of R;—indeed, Figure 3.1 is essentially a 
graph of VIF,—our previous discussion can b cast entirely in temis of the VIF. For example, if we 
had the choice, we would like VIF, to be smaller (other things equal). But we rarely have the choice. 
If we think certain explanatory sariables need to be included in a regression to infer causality of x;, 
then we are hesitant to drop them, and whether we think VIF; is “too high” cannot really affect that 
decision. If, say, our main interest is in the causal effect of x, on y, then we should ignore entirely the 
VIFs of other coefficients. Finally, setting a cutoff value for VIF above which we conclude multicol- 
linearity is a “problem” is arbitrary and not especially helpful. Sometimes the value 10 is chosen: if 
VIF; is above 10 (equivalently, Ri is above .9), then we conclude that multicollinearity is a “problem” 
for ee 6; But a VIF; ae 10 does not mean that the standard deviation of Ê, y; İs too large to be 
useful because the standard deviation also depends on o and SST,, and the latter can be increased by 
increasing the sample size. Therefore, just as with looking at the ma of R? directly, looking at the size 
of VIF; is of limited use, although one might want to do so out of curiosity. 


3-4b Variances in Misspecified Models 


The choice of whether to include a particular variable in a regression model can be made by analyzing 
the tradeoff between bias and variance. In Section 3-3, we derived the bias induced by leaving out a 
relevant variable when the true model contains two explanatory variables. We continue the analysis of 
this model by comparing the variances of the OLS estimators. 

Write the true population model, which satisfies the Gauss-Markov assumptions, as 


y = Bo + Bix + Boxy + u. 
We consider two estimators of 8,. The estimator Êi comes from the multiple regression 
3 = Bo + Bix + Box. [3.52] 


In other words, we include x, along with xj, in the regression model. The estimator 6, is obtained by 
omitting x, from the model and running a simple regression of y on x;: 


= Bo + Bir. [3.53] 
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When $, + 0, equation (3.53) excludes a relevant variable from the model and, as we saw in 
Section 3-3, this induces a bias in B, unless x, and x, are uncorrelated. On the other hand, Êi is unbi- 
ased for 8, for any value of B,, including B, = 0. It follows that, if bias is used as the only criterion, 
By is preferred to ,. 

The conclusion that B , is always preferred to 8, does not carry over when we bring variance into 
the picture. Conditioning on the values of x, and x, in the sample, we have, from (3.51), 


Var(B,) = o7/[SST,(1 — R3)], [3.54] 


where SST, is the total variation in x,, and Rj is the R-squared from the regression of x, on x». Further, 
a simple modification of the proof in Chapter 2 for two-variable regression shows that 


Var(B,) = o°/SST). [3.55] 


Comparing (3.55) to (3.54) shows that Var(Ø,) is always smaller than Var(B,), unless x; and x, are 
uncorrelated in the sample, in which case the two estimators 6, and B, are the same. Assuming that x, 
and x, are not uncorrelated, we can draw the following conclusions: 


1. When £ # 0, A, is biased, Ê; is unbiased, and Var(8,) < Var(ĝ,). 
2. When B, = 0, Ø, and B, are both unbiased, and Var(,) < Var(8,). 


From the second conclusion, it is clear that B, is preferred if 6, = 0. Intuitively, if x, does not have a 
partial effect on y, then including it in the model can only exacerbate the multicollinearity problem, 
which leads to a less efficient estimator of 6. A higher variance for the estimator of 8, is the cost of 
including an irrelevant variable in a model. 

The case where B, # 0 is more difficult. Leaving x, out of the model results in a biased estimator 
of 6. Traditionally, econometricians have suggested comparing the likely size of the bias due to omit- 
ting x, with the reduction in the variance—summarized in the size of Rj—to decide whether x, should 
be included. However, when 6, # 0, there are two favorable reasons for including x, in the model. 
The most important of these is that any bias in 8, does not shrink as the sample size grows; in fact, 
the bias does not necessarily follow any pattern. Therefore, we can usefully think of the bias as being 
roughly the same for any sample size. On the other hand, Var(,) and Var(,) both shrink to zero as 
n gets large, which means that the multicollinearity induced by adding x, becomes less important as 
the sample size grows. In large samples, we would prefer B i 

The other reason for favoring B ı is more subtle. The variance formula in (3.55) is conditional on 
the values of x; and xp in the sample, which provides the best scenario for B;. When B, # 0, the vari- 
ance of B, conditional only on x, is larger than that presented in (3.55). Intuitively, when 8, # 0 and 
X is excluded from the model, the error variance increases because the error effectively contains part 
of x. But the expression in equation (3.55) ignores the increase in the error variance because it will 
treat both regressors as nonrandom. For practical purposes, the o term in equation (3.55) increases 
when x, is dropped from the equation. A full discussion of the proper conditioning argument when 
computing the OLS variances would lead us too far astray. Suffice it to say that equation (3.55) is too 
generous when it comes to measuring the precision of B,. Fortunately, statistical packages report the 
proper variance estimator, and so we need not worry about the subtleties in the theoretical formulas. 
After reading the next subsection, you might want to study Problems 14 and 15 for further insight. 


3-4c Estimating o°: Standard Errors of the OLS Estimators 


We now show how to choose an unbiased estimator of a”, which then allows us to obtain unbiased 
estimators of Var(B;). 

Because g? = E(u’), an unbiased “estimator” of a? is the sample average of the squared errors: 
n '>'_,u?. Unfortunately, this is not a true estimator because we do not observe the u;. Nevertheless, 
recall that the errors can be written as u; = y; — By — Bixa — Bx — 7 — ByXiz, and so the reason 
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we do not observe the u; is that we do not know the 6;. When we replace each 6; with its OLS estima- 
tor, we get the OLS residuals: 


a; = yi — Bo T Bika T Box — + Pik 
It seems natural to estimate a” by replacing u; with the @,. In the simple regression case, we saw 


that this leads to a biased estimator. The unbiased estimator of a? in the general multiple regression 
case is 


6? = (Sa)/e -k — 1) = SSR/(n - k - 1). [3.56] 


We already encountered this estimator in the k = 1 case in simple regression. 

The term n — k — 1 in (3.56) is the degrees of freedom (df) for the general OLS problem with n 
observations and k independent variables. Because there are k + 1 parameters in a regression model 
with k independent variables and an intercept, we can write 


df=n-— (k+ 1) [3.57] 


= (number of observations) — (number of estimated parameters). 


This is the easiest way to compute the degrees of freedom in a particular application: count the num- 
ber of parameters, including the intercept, and subtract this amount from the number of observations. 
(In the rare case that an intercept is not estimated, the number of parameters decreases by one.) 

Technically, the division by n — k — 1 in (3.56) comes from the fact that the expected value of the 
sum of squared residuals is E(SSR) = (n — k — 1)o%. Intuitively, we can figure out why the degrees 
of freedom adjustment is necessary by returning to the first order conditions for the OLS estimators. 
These can be written È; ô; = 0 and j_,x,ft; = 0, where j = 1, 2,..., k. Thus, in obtaining the 
OLS estimates, k + 1 restrictions are imposed on the OLS residuals. This means that, given n — (k + 1) 
of the residuals, the remaining k + 1 residuals are known: there are only n — (k + 1) degrees of freedom 
in the residuals. (This can be contrasted with the errors u; which have n degrees of freedom in the sample.) 

For reference, we summarize this discussion with Theorem 3.3. We proved this theorem for the 
case of simple regression analysis in Chapter 2 (see Theorem 2.3). (A general proof that requires 
matrix algebra is provided in Advanced Treatment E.) 


ILII UNBIASED ESTIMATION OF o? 


3.3 Under the Gauss-Markov assumptions MLR.1 through MLR.5, E(é?) = o°. 


The positive square root of G*, denoted G, is called the standard error of the regression (SER). The 
SER is an estimator of the standard deviation of the error term. This estimate is usually reported by 
regression packages, although it is called different things by different packages. (In addition to SER, 
G is also called the standard error of the estimate and the root mean squared error.) 

Note that ô can either decrease or increase when another independent variable is added to a 
regression (for a given sample). This is because, although SSR must fall when another explanatory 
variable is added, the degrees of freedom also falls by one. Because SSR is in the numerator and df is 
in the denominator, we cannot tell beforehand which effect will dominate. 

For constructing confidence intervals and conducting tests in Chapter 4, we will need to estimate 
the standard deviation of Ê; which is just the square root of the variance: 


sd(B;) = o/[SST,(1 — R?) J". 
Because ø is unknown, we replace it with its estimator, &. This gives us the standard error of B;: 


se(B,) = &/[SST,(1 — R?)]!. [3.58] 
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Just as the OLS estimates can be obtained for any given sample, so can the standard errors. Because 
se( Ê j) depends on G, the standard error has a sampling distribution, which will play a role in Chapter 4. 

We should emphasize one thing about standard errors. Because (3.58) is obtained directly from 
the variance formula in (3.51), and because (3.51) relies on the homoskedasticity Assumption MLR.5, 
it follows that the standard error formula in (3.58) is not a valid estimator of sd(;) if the errors 
exhibit heteroskedasticity. Thus, while the presence of heteroskedasticity does not cause bias in the £, 
it does lead to bias in the usual formula for Var( B j) which then invalidates the standard errors. This is 
important because any regression package computes (3.58) as the default standard error for each coef- 
ficient (with a somewhat different representation for the intercept). If we suspect heteroskedasticity, 
then the “usual” OLS standard errors are invalid, and some corrective action should be taken. We will 
see in Chapter 8 what methods are available for dealing with heteroskedasticity. 

For some purposes it is helpful to write 


se(B;) [3.59] 


Co 
Vasd(x,)V1 = R? 


in which we take sd(x;) = Va! Di-1(%; — X) to be the sample standard deviation where the total 
sum of squares is divided by n rather than n — 1. The importance of equation (3.59) is that it shows 
how the sample size, n, directly affects the standard errors. The other three terms in the formula—o, 
sd(x;), and R;—will change with different samples, but as n gets large they settle down to constants. 
Therefore, we can see from equation (3.59) that the standard errors shrink to zero at the rate 1/ Vn. 
This formula demonstrates the value of getting more data: the precision of the Ê; increases as n 
increases. (By contrast, recall that unbiasedness holds for any sample size subject to being able to 
compute the estimators.) We will talk more about large sample properties of OLS in Chapter 5. 


3-5 Efficiency of OLS: The Gauss-Markov Theorem 


In this section, we state and discuss the important Gauss-Markov Theorem, which justifies the use 
of the OLS method rather than using a variety of competing estimators. We know one justification for 
OLS already: under Assumptions MLR.1 through MLR.4, OLS is unbiased. However, there are many 
unbiased estimators of the 6; under these assumptions (for example, see Problem 13). Might there be 
other unbiased estimators with variances smaller than the OLS estimators? 

If we limit the class of competing estimators appropriately, then we can show that OLS is best 
within this class. Specifically, we will argue that, under Assumptions MLR.1 through MLR.5, the 
OLS estimator Ê; for $; is the best linear unbiased estimator (BLUE). To state the theorem, we 
need to understand each component of the acronym “BLUE.” First, we know what an estimator is: 
it is a rule that can be applied to any sample of data to produce an estimate. We also know what an 
unbiased estimator is: in the current context, an estimator, say, Bi of b; is an unbiased estimator of B; 
if E(B,) = B; for any Bo, By, ..- > By. 7 

What about the meaning of the term “linear”? In the current context, an estimator £, of 6; is lin- 
ear if, and only if, it can be expressed as a linear function of the data on the dependent variable: 


B= “wis [3.60] 
i=l 


where each w; can be a function of the sample values of all the independent variables. The OLS esti- 
mators are linear, as can be seen from equation (3.22). 

Finally, how do we define “best”? For the current theorem, best is defined as having the smallest 
variance. Given two unbiased estimators, it is logical to prefer the one with the smallest variance (see 


Math Refresher C). 
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Now, let Bos Bi. oe Be denote the OLS estimators in model (3.31) under Assumptions MLR. 1 
through MLR.5. The Gauss-Markov Theorem says that, for any estimator B, that is linear and unbi- 
ased, Var(ĝ;) = Var(f;), and the inequality is usually strict. In other words, in the class of linear 
unbiased estimators, OLS has the smallest variance (under the five Gauss-Markov assumptions). 
Actually, the theorem says more than this. If we want to estimate any linear function of the 6;, then 
the corresponding linear combination of the OLS estimators achieves the smallest variance among 
all linear unbiased estimators. We conclude with a theorem, which is proven in Appendix 3A. It is 
because of this theorem that Assumptions MLR.1 through MLR.5 are known as the Gauss-Markov 
assumptions (for cross-sectional analysis). 


1111111) E GAUSS-MARKOV THEOREM 


3.4 Under Assumptions MLR.1 through MLR.5, fp. Êi, .... Ê are the best linear unbiased estimators 
(BLUEs) of Bo, Bi, -- - , Bk, respectively. 


The importance of the Gauss-Markov Theorem is that, when the standard set of assumptions 
holds, we need not look for alternative unbiased estimators of the form in (3.60): none will be better 
than OLS. Equivalently, if we are presented with an estimator that is both linear and unbiased, then 
we know that the variance of this estimator is at least as large as the OLS variance; no additional cal- 
culation is needed to show this. 

For our purposes, Theorem 3.4 justifies the use of OLS to estimate multiple regression models. If 
any of the Gauss-Markov assumptions fail, then this theorem no longer holds. We already know that 
failure of the zero conditional mean assumption (Assumption MLR.4) causes OLS to be biased, so 
Theorem 3.4 also fails. We also know that heteroskedasticity (failure of Assumption MLR.5) does not 
cause OLS to be biased. However, OLS no longer has the smallest variance among linear unbiased 
estimators in the presence of heteroskedasticity. In Chapter 8, we analyze an estimator that improves 
upon OLS when we know the brand of heteroskedasticity. 


3-6 Some Comments on the Language of Multiple 
Regression Analysis 


It is common for beginners, and not unheard of for experienced empirical researchers, to report that 
they “estimated an OLS model.” Although we can usually figure out what someone means by this 
statement, it is important to understand that it is wrong—on more than just an aesthetic level—and 
reflects a misunderstanding about the components of a multiple regression analysis. 

The first thing to remember is that ordinary least squares (OLS) is an estimation method, not a 
model. A model describes an underlying population and depends on unknown parameters. The linear 
model that we have been studying in this chapter can be written—in the population—as 


y = Bo + Bix, ++: + Bey +u, [3.61] 


where the parameters are the 6;. Importantly, we can talk about the meaning of the 6; without ever 
looking at data. It is true we cannot hope to learn much about the 6; without data, but the interpreta- 
tion of the £, is obtained from the linear model in equation (3.61). 

Once we have a sample of data we can estimate the parameters. Although it is true that we have 
so far only discussed OLS as a possibility, there are actually many more ways to use the data than we 
can even list. We have focused on OLS due to its widespread use, which is justified by using the statis- 
tical considerations we covered previously in this chapter. But the various justifications for OLS rely 
on the assumptions we have made (MLR.1 through MLR.5). As we will see in later chapters, under 
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different assumptions different estimation methods are preferred—even though our model can still be 
represented by equation (3.61). Just a few examples include weighted least squares in Chapter 8, least 
absolute deviations in Chapter 9, and instrumental variables in Chapter 15. 

One might argue that the discussion here is overly pedantic, and that the phrase “estimating an 
OLS model” should be taken as a useful shorthand for “I estimated a linear model by OLS.” This 
stance has some merit, but we must remember that we have studied the properties of the OLS estima- 
tors under different assumptions. For example, we know OLS is unbiased under the first four Gauss- 
Markov assumptions, but it has no special efficiency properties without Assumption MLR.5. We have 
also seen, through the study of the omitted variables problem, that OLS is biased if we do not have 
Assumption MLR.4. The problem with using imprecise language is that it leads to vagueness on 
the most important considerations: what assumptions are being made on the underlying linear model? 
The issue of the assumptions we are using is conceptually different from the estimator we wind up 
applying. 

Ideally, one writes down an equation like (3.61), with variable names that are easy to decipher, 
such as 


math4 = By + B,classize4 + B,math3 + B,log(income) [3.62] 
+ Bymotheduc + B;fatheduc + u 


if we are trying to explain outcomes on a fourth-grade math test. Then, in the context of equation 
(3.62), one includes a discussion of whether it is reasonable to maintain Assumption MLR.4, focus- 
ing on the factors that might still be in u and whether more complicated functional relationships are 
needed (a topic we study in detail in Chapter 6). Next, one describes the data source (which ideally 
is obtained via random sampling) as well as the OLS estimates obtained from the sample. A proper 
way to introduce a discussion of the estimates is to say “I estimated equation (3.62) by ordinary least 
squares. Under the assumption that no important variables have been omitted from the equation, and 
assuming random sampling, the OLS estimator of the class size effect, 6,, is unbiased. If the error 
term u has constant variance, the OLS estimator is actually best linear unbiased.” As we will see in 
Chapters 4 and 5, we can often say even more about OLS. Of course, one might want to admit that 
while controlling for third-grade math score, family income and parents’ education might account for 
important differences across students, it might not be enough—for example, u can include motivation 
of the student or parents—in which case OLS might be biased. 

A more subtle reason for being careful in distinguishing between an underlying population model 
and an estimation method used to estimate a model is that estimation methods such as OLS can be 
used essentially as an exercise in curve fitting or prediction, without explicitly worrying about an 
underlying model and the usual statistical properties of unbiasedness and efficiency. For example, we 
might just want to use OLS to estimate a line that allows us to predict future college GPA for a set of 
high school students with given characteristics. 


3-7 Several Scenarios for Applying Multiple Regression 


Now that we have covered the algebraic and statistical properties of OLS, it is a good time to catalog 
different scenarios where unbiasedness of OLS can be established. In particular, we are interested in 
situations verifying Assumptions MLR.1 and MLR.4, as these are the important population assump- 
tions. Assumption MLR.2 concerns the sampling scheme, and the mild restrictions in Assumption 
MLR.3 are rarely a concern. 

The linearity assumption in MLR.1, where the error term u is additive, is always subject to criti- 
cism, although we know that it is not nearly as restrictive as it might seem because we can use trans- 
formations of both the explained and explanatory variables. Plus, the linear model is always a good 
starting point, and often provides a suitable approximation. In any case, for the purposes of the fol- 
lowing discussion the functional form issue is not critical. 
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3-7a Prediction 


As suggested at the end of Section 3-6, sometimes we are interested in a pure prediction exercise, 
where we hope to predict the outcome of a variable, y, given a set of observed variables, x1, X2, ... , Xp 
To continue the example previously mentioned, a college admissions officer might want to predict the 
success of applicants—as measured by, say, future college GPA, y—based on information available 
at the time of application. These variables, which include performance variables from high school 
(GPA, kinds of classes taken, standardized test scores) and possibly family background, comprise 
the explanatory variables. As described in Math Refresher B, the best predictor of y, as measured by 


mean squared error, is the conditional expectation, Eo |x, ..., X4). If we assume a linear function for 
the conditional expectation then 
Elx, 66 X) = Bo = Bixi ++ + Bers 


which is the same as writing 


y= Po + Bix, +++ + Bey, tu 
Etulx,, ..., X4) = 0. 


In other words, MLR.4 is true by construction once we assume linearity. If we have a random sample 
on the x; and y and we can rule out perfect collinearity, we can obtain unbiased estimators of the 6; by 
OLS. In the example of predicting future GPAs, we would obtain a sample of students who attended 
the university so that we can observe their college GPAs. (Whether this provides a random sample 
from the relevant population is an interesting question, but too advanced to discuss here. We do so in 
Chapters 9 and 17.) 

By estimating the 6; we can also see which factors are most important for predicting future col- 
lege success—as a way to fine tune our prediction model. But we do not yet have a formal way of 
choosing which explanatory variables to include; that comes in the next chapter. 


3-7b Efficient Markets 


Efficient markets theories in economics, as well as some other theories, often imply that a single 
variable acts as a “sufficient statistic” for predicting the outcome variable, y. For emphasis, call this 
special predictor w. Then, given other observed factors, say x,,..., x,, we might want to test the 
assumption 


Elw, x) = Elw), [3.63] 
where x is a shorthand for (x), x2, .. . , x4). We can test (3.63) using a linear model for E(y|w, x): 
EQy|w, x) = Bo + Bw + yix tt + VX [3.64] 


where the slight change in notation is used to reflect the special status of w. In Chapter 4 we will learn 
how to test whether all of the y; are zero: 


Vie Ht Sy, = 9, [3.65] 
Many efficient markets theories imply more than just (3.63). In addition, typically 
E(ylw) = w, 


which means that, in the linear equation (3.64), Bọ = 0 and 6, = 1. Again, we will learn how to test 
such restrictions in Chapter 4. 

As a specific example, consider the sports betting market—say, for college football. The gam- 
bling markets produce a point spread, w = spread, which is determined prior to a game being played. 
The spread typically varies a bit during the days preceding the game, but it eventually settles on some 
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value. (Typically, the spread is in increments of 0.5.) The actually score differential in the game is 
y = scorediff. Efficiency of the gambling market implies that 


E(scoredifj|spread, x,,... , X,) = E(scorediff|spread) = spread, 


where x4, .. . , x, includes any variables observable to the public prior to the game being played. 
Examples include previous winning percentage, where the game is played, known injuries to key 
players, and so on. The idea is that, because lots of money is involved in betting, the spread will move 
until it incorporates all relevant information. Multiple regression can be used to test the efficient mar- 
kets hypothesis because MLR.4 holds by construction once we assume a linear model: 


y= Bot Bw tyr tees t+ yxy tu [3.66] 
E(u|w, x,,...,%,) = 0, [3.67] 
where the explanatory variables are w, x,,... , Xg 


Incidentally, it may be possible to think of a variable to be included among the x; that the market 
has not incorporated into the spread. To be useful, it must be a variable that one can observe prior to 
the game being played. Most tests of the efficiency of the gambling markets show that, except for 
short aberrations, the market is remakably efficient. 


3-7c Measuring the Tradeoff between Two Variables 


Sometimes regression models are used not to predict, or to determe causality, but to simply measure 
how an economic agent trades off one variable for another. Call these variables y and w. For example, 
consider the population of K-12 teachers in a state in the United States. Let y be annual salary and 
w be a measure of pension compensation. If teachers are indifferent between a dollar of salary and a 
dollar of pension, then, on average, a one-dollar increase in pension compensation should be associ- 
ated with a one-dollar fall in salary. In other words, only total compensation matters. Naturally, this is 
a ceteris paribus question: all other relevant factors should be held fixed. In particular, we would expect 
to see a positive correlation between salary and pension benefits because pension benefits are often 
tied to salary. We want to know, for a given teacher, how does that teacher trade off one for the other. 

Because we are simply measuring a tradeoff, it should not matter which variable we choose as 
y and which we choose as w. However, functional form considerations can come into play. (We will 
see this later in Example 4.10 in Chapter 4, where we study the salary-benefits tradeoff using aggre- 
gated data.) Once we have chosen y and w, and we have controls x = (x,,..., X,), we are, as in 
Section 3-7b, interested in E(y|w, x). Assuming a linear model, we are exactly in the situation given in 
equations (3.66) and (3.67). A key difference is that, assuming the x; properly control for differences in 
individuals, the theory of a one-to-one tradeoff is 8; = —1, without restricting the intercept, By. That 
is quite different from the efficient markets hypothesis. Further, we include the x; to control for differ- 
ences; we do not expect the y; to be zero, and we would generally have no interest in testing (3.65). 

If we are not able to include sufficient controls in x then the estimated tradeoff coefficient, 6, will 
be biased (although the direction depends on what we think we have omitted). This is tantamount to 
an omitted variable problem. For example, we may not have a suitable measure of teachers’ taste for 
saving or amount of risk aversion. 


3-7d Testing for Ceteris Paribus Group Differences 


Another common application of multiple regression analysis is to test for differences among groups— 
often, groups of people—once we account for other factors. In Section 2-7 we discussed the example 
of estimating differences in hourly wage, wage, based on race, which is divided into white and other. 
To this end, define a binary variable white. In Section 2-7a we noted that finding a difference in aver- 
age wages across whites and nonwhites did not necessarily indicate wage discrimination because 
other factors could contribute to such a differece. 
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Let x1, X2, ..., x, denote other observable factors that can affect hourly wage—such as education, 
workforce experience, and so on. Then we are interested in 


E(wage|white, x,,.. . , X;)- 


If we have accounted for all factors in wage that should affect productivity, then wage differences by 
race might be attributable to discrimination. In the simplest case, we would use a linear model: 


E(wage|white, x, . . . , X) = Bo + Bı white + yx, + +++ + yx, [3.68] 


where we are primarily interested in the coefficient 6, which measures the difference in whites and 
nonwhites given the same levels of the control variables, x,, X2, ... , x, (education, experience, and 
so on). For a general y and w, we again have (3.66) and (3.67) in force, and so MLR.4 holds by con- 
struction. OLS can be used to obtain an unbiased estimator of 6 (and the other coefficients). Problems 
arise when we cannot include all suitable variables among the x;, in which case, again, we have an 
omitted variable problem. In the case of testing for racial or gender discrimination, failure to control 
for all relevant factors can cause systematic bias in estimating discrepancies due to discrimination. 


3-7e Potential Outcomes, Treatment Effects, and Policy Analysis 


For most practicing economists, the most exciting applications of multiple regression are in trying 
to estimate causal effects of policy interventions. Do job training programs increase labor earnings? 
By how much? Do school choice programs improve student outcomes? Does legalizing marijuana 
increase crime rates? 

We introduced the potential outcomes approach to studying policy questions in Section 2-7a. In 
particular, we studied simple regression in the context of a binary policy intervention, using the notion 
of counterfactual outcomes. In this section we change notation slightly, using w to denote the binary 
intervention or policy indicator. As in Section 2-7a, for each unit in the population we imagine the 
existence of the potential outcomes, y(0) and y(1)—representing different states of the world. If we 
assume a constant treatment effect, say 7, then we can write, for any unit i, 


yid) = T + y{0). 
When the treatment effect can vary by i, the average treatment effect is 
Tate = Ely(1) — y(0)], [3.69] 


where the expectation is taken over the entire population. 
For a random draw i, the outcome we observe, y, can be written 


yi =  — wy)y(0) + wyi). [3.70] 


One of the important conclusions from Section 2-7a is that the simple regression of y on w (with 
an intercept, as usual) is an unbiased estimator of 7,,. only if we have random assignment of w—that is, 


w is independent of [y(0), y(1)]. 


Random assignment is still pretty rare in business, economics, and other social sciences because true 
experiments are still somewhat rare. Fortunately, if we can control variables—variables that help pre- 
dict the potential outcomes and determine assignment into the treatment and control groups—we 
can use multiple regression. Letting x again denote a set of control variables, consider the following 
assumption: 


w is independent of [y(0), y(1)] conditional on x. [3.71] 


For fairly obvious reasons, this assumption is called conditional independence, where it is impor- 
tant to note the variables in x that are in the conditioning set. In the treatment effects literature, 
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(3.71) is also called unconfounded assignment or unconfoundedness conditional on x. The terms 
ignorable assignment and ignorability are also used. 

Assumption (3.71) has a simple interpretation. Think of partitioning the population based 
on the observed variables in x. For concreteness, consider the job training program introduced in 
Section 2-7a. There, w indicates whether a worker participates in a job training program, and y is 
an outcome such as labor income. The elements in x include education, age, and past labor market 
history, such as earnings from the previous couple of years. Suppose that workers are more likely to 
participate in the program the lower their education, age, and the worse their previous labor market 
outcomes. Then, because education, age, and prior labor market history are very likely to predict y(0) 
and y(1), random assignment does not hold. Nevertheless, once we group people by education, age, 
and prior work history, it is possible that assignment is random. As a concrete example, consider the 
group of people with 12 years of schooling who are 35 years old and who had average earnings of 
$25,000 the past two years. What (3.71) requires is that within this group, assignment to the treatment 
and control groups is random. 

The more variables we observe prior to implementation of the program the more likely (3.71) 
is to hold. If we observe no information to include in x then we are back to assuming pure random 
assignment. Of course, it is always possible that we have not included the correct variables in x. 
For example, perhaps everyone in a sample from the eligible population was administered a test to 
measure intelligence, and assignment to the program is partly based on the score from the test. If we 
observe the test score, we include it in x. If we cannot observe the test score, it must be excluded from 
x and (3.71) would generally fail—although it could be “close” to being true if we have other good 
controls in x. 

How can we use (3.71) in multiple regression? Here we only consider the case of a constant treat- 
ment effect, T. Section 7-6 in Chapter 7 considers the more general case. Then, in the population, 


y = y0) + 7w 
and 
EGQ|w, x) = EO)|w, x] + tw = E[y(0)|x] + 7w, [3.72] 
where the second equality follows from conditional independence. Now assume that E[y(0)|x] 
is linear, 
E[y(0)|x] = a + xy. 

Plugging in gives 

EQ|w,x) = a + Tw + xy =a +rTw + yx, t+: + xx. [3.73] 
As in several previous examples in this section, we are interested primarily in the coefficient on w, 
which we have called 7. The y; are of interest for logical consistency checks—for example, we should 
expect more education to lead to higher earnings, on average—but the main role of the x; is to control 
for differences across units. 


In Chapter 7 we will cover treatment effects in more generality, including how to use multiple 
regression when treatment effects vary by unit (individual in the job training case). 


Evaluating a Job Training Program 


The data in JTRAIN98 are on male workers that can be used to evaluate a job training program, 
where the variable we would like to explain, y = earn98 is labor market earnings in 1998, the year 
following the job training program (which took place in 1997). The earnings variable is measured 
in thousands of dollars. The variable w = train is the binary participation (or “treatment’’) indicator. 
The participation in the job training program was partly based on past labor market outcomes and 
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is partly voluntary. Therefore, random assignment is unlikely to be a good assumption. As control 
variables we use earnings in 1996 (the year prior to the program), years of schooling (educ), age, and 
marital status (married). Like the training indicator, marital status is coded as a binary variable, where 
married = 1 means the man is married. 

The simple regression estimates are 


earn98 = 10.61 — 2.05 train [3.74] 
n = 1,130, R? = 0.016 


Because earns98 is measured in thousands of dollars, the coefficient on train, —2.05, shows that, on 
average, those participating in the program earned $2,050 less than those who did not. The average 
earnings for those who did not participate is gotten from the intercept, so $10,610. 

Without random assignment, it is possible, even likely, that the negative (and large in magni- 
tude) coefficient on train is a product of nonrandom selection into participation. This could be either 
because men with poor earnings histories were more likely to be chosen or that such men are more 
likely to participate if made eligible. We will not examine these propositions in detail here. Instead, 
we add the four controls and perform a multiple regression: 


earn98 = 4.67 + 2.41 train + . 373 earn96 + . 363 educ — .181 age + 2.48 married [3.75] 
n = 1,130, R? = 0.405 


The change in the coefficient on train is remarkable: the program is now estimated to increase 
earnings, on average, by $2,410. In other words, controlling for differences in preprogram earn- 
ings, education levels, age, and marital status produces a much different estimate than the simple 
difference-in-means estimate. 

The signs of the coefficients on the control variables are not surprising. We expect earnings to be 
positively correlated over time—so earns96 has a positive coefficient. Workers with more education 
also earn more: about $363 for each additional year. The marriage effect is roughly as large as the job 
training effect: ceteris paribus, married men earn, on average, about $2,480 more than their single 
counterparts. 

The predictability of the control variables is indicated by the R-squared in the multiple regres- 
sion, R? = 0.405. There is still much unexplained variation, but collectively the variables do a pretty 
good job. 


Before we end this section, a final remark is in 
order. As with the other examples in this chapter, we 
GOING FURTHER 3.5 have not determined the statistical significance of the 
Does it make sense to compare the inter- estimates. We remedy this omission in Chapter 4, 
cepts in equations (3.74) and (3.75)? | Where we learn how to test whether there is an effect 
Explain. in the entire population, and also obtain confidence 
intervals for the parameters, such as the average 
treatment effect of a job training program. 


Summary 


1. The multiple regression model allows us to effectively hold other factors fixed while examining the 
effects of a particular independent variable on the dependent variable. It explicitly allows the indepen- 
dent variables to be correlated. 

2. Although the model is linear in its parameters, it can be used to model nonlinear relationships by 
appropriately choosing the dependent and independent variables. 
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3. The method of ordinary least squares is easily applied to estimate the multiple regression model. Each 
slope estimate measures the partial effect of the corresponding independent variable on the dependent 
variable, holding all other independent variables fixed. 

4. R? is the proportion of the sample variation in the dependent variable explained by the independent 
variables, and it serves as a goodness-of-fit measure. It is important not to put too much weight on the 
value of R? when evaluating econometric models. 

5. Under the first four Gauss-Markov assumptions (MLR.1 through MLR.4), the OLS estimators are 
unbiased. This implies that including an irrelevant variable in a model has no effect on the unbiased- 
ness of the intercept and other slope estimators. On the other hand, omitting a relevant variable causes 
OLS to be biased. In many circumstances, the direction of the bias can be determined. 

6. Under the five Gauss-Markov assumptions, the variance of an OLS slope estimator is given by 
Var(B;) = o°/[SST,(1 — R;)]. As the error variance g? increases, so does Var(B;), while Var(B;) 
decreases as the sample variation in x;, SST;, increases. The term R? measures the amount of collinear- 
ity between x; and the other explanatory variables. As R? approaches one, Var( Ê) is unbounded. 

7. Adding an irrelevant variable to an equation generally increases the variances of the remaining OLS 
estimators because of multicollinearity. 

8. Under the Gauss-Markov assumptions (MLR.1 through MLR.5), the OLS estimators are the best 
linear unbiased estimators (BLUEs). 

9. Section 3-7 discusses the various ways that multiple regression analysis is used in economics and 
other social sciences, including for prediction, testing efficient markets, estimating tradeoffs between 
variables, and evaluating policy interventions. We will see examples of all such applications in the 
remainder of the text. 

10. Beginning in Chapter 4, we will use the standard errors of the OLS coefficients to compute confidence 
intervals for the population parameters and to obtain test statistics for testing hypotheses about the 
population parameters. Therefore, in reporting regression results we now include the standard errors 
along with the associated OLS estimates. In equation form, standard errors are usually put in parenthe- 
ses below the OLS estimates, and the same convention is often used in tables of OLS output. 


THE GAUSS-MARKOV ASSUMPTIONS 


The following is a summary of the five Gauss-Markov assumptions that we used in this chapter. Remember, 
the first four were used to establish unbiasedness of OLS, whereas the fifth was added to derive the usual 
variance formulas and to conclude that OLS is best linear unbiased. 


Assumption MLR.1 (Linear in Parameters) 
The model in the population can be written as 


y = Bo + Bixi + Baxa +: + Bey + u, 


where fo, Bi, - - - , 8; are the unknown parameters (constants) of interest and u is an unobserved random 
error or disturbance term. 


Assumption MLR.2 (Random Sampling) 
We have a random sample of n observations, {(x;, Xi - - - , Xœ Yi): i = 1, 2, . . . , n}, following the popula- 
tion model in Assumption MLR. 1. 


Assumption MLR.3 (No Perfect Collinearity) 
In the sample (and therefore in the population), none of the independent variables is constant, and there are 
no exact linear relationships among the independent variables. 


Assumption MLR.4 (Zero Conditional Mean) 
The error u has an expected value of zero given any values of the independent variables. In other words, 


E(ulx), X2... , x4) = 0. 
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Assumption MLR.5 (Homoskedasticity) 
The error u has the same variance given any value of the explanatory variables. In other words, 


Key Terms 


Var(ulx,, ea Xz) =o". 


Best Linear Unbiased Estimator 
(BLUE) 

Biased Toward Zero 

Ceteris Paribus 

Conditional Independence 

Degrees of Freedom (df) 

Disturbance 

Downward Bias 

Endogenous Explanatory Variable 

Error Term 

Excluding a Relevant Variable 

Exogenous Explanatory Variables 

Explained Sum of Squares (SSE) 

First Order Conditions 

Frisch-Waugh Theorem 

Gauss-Markov Assumptions 

Gauss-Markov Theorem 


Problems 


Ignorable Assignment 

Inclusion of an Irrelevant Variable 

Intercept 

Micronumerosity 

Misspecification Analysis 

Multicollinearity 

Multiple Linear Regression (MLR) 
Model 

Multiple Regression Analysis 

OLS Intercept Estimate 

OLS Regression Line 

OLS Slope Estimates 

Omitted Variable Bias 

Ordinary Least Squares 

Overspecifying the Model 

Partial Effect 

Perfect Collinearity 


Population Model 
Residual 
Residual Sum of Squares 
Sample Regression 
Function (SRF) 
Slope Parameters 
Standard Deviation of Ê; 
Standard Error of 6; 
Standard Error of the 
Regression (SER) 
Sum of Squared Residuals (SSR) 
Total Sum of Squares (SST) 
True Model 
Unconfounded Assignment 
Underspecifying the Model 
Upward Bias 
Variance Inflation Factor (VIF) 


1 Using the data in GPA2 on 4,137 college students, the following equation was estimated by OLS: 


ee es 
colgpa = 1.392 — .0135 hsperc + .00148 sat 


n = 4,137, R? = .273, 


where colgpa is measured on a four-point scale, Asperc is the percentile in the high school graduating 

class (defined so that, for example, hsperc = 5 means the top 5% of the class), and sat is the combined 

math and verbal scores on the student achievement test. 

(1) | Why does it make sense for the coefficient on hsperc to be negative? 

(ii) What is the predicted college GPA when hsperc = 20 and sat = 1,050? 

(iii) Suppose that two high school graduates, A and B, graduated in the same percentile from high 
school, but Student A’s SAT score was 140 points higher (about one standard deviation in the sam- 
ple). What is the predicted difference in college GPA for these two students? Is the difference large? 

(iv) Holding hsperc fixed, what difference in SAT scores leads to a predicted colgpa difference of 
.50, or one-half of a grade point? Comment on your answer. 


2 The data in WAGE2 on working men was used to estimate the following equation: 


— 


educ = 10.36 — .094 sibs + .131 meduc + .210 feduc 
n = 722, R = .214, 


where educ is years of schooling, sibs is number of siblings, meduc is mother’s years of schooling, 

and feduc is father’s years of schooling. 

(i) Does sibs have the expected effect? Explain. Holding meduc and feduc fixed, by how much does 
sibs have to increase to reduce predicted years of education by one year? (A noninteger answer 
is acceptable here.) 

(ii) Discuss the interpretation of the coefficient on meduc. 
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(iii) Suppose that Man A has no siblings, and his mother and father each have 12 years of education, 
and Man B has no siblings, and his mother and father each have 16 years of education. What is 
the predicted difference in years of education between B and A? 


The following model is a simplified version of the multiple regression model used by Biddle and 
Hamermesh (1990) to study the tradeoff between time spent sleeping and working and to look at 
other factors affecting sleep: 


sleep = By + B,totwrk + B,educ + Page + u, 


where sleep and totwrk (total work) are measured in minutes per week and educ and age are measured 
in years. (See also Computer Exercise C3 in Chapter 2.) 

(Gi) If adults trade off sleep for work, what is the sign of B,? 

Gi) What signs do you think $, and $, will have? 

(iii) Using the data in SLEEP75S, the estimated equation is 


a 
sleep = 3,638.25 — .148 totwrk — 11.13 educ + 2.20 age 
n = 706, R? = .113. 


If someone works five more hours per week, by how many minutes is s/eep predicted to fall? Is 
this a large tradeoff? 

(iv) Discuss the sign and magnitude of the estimated coefficient on educ. 

(v) Would you say totwrk, educ, and age explain much of the variation in sleep? What other factors 
might affect the time spent sleeping? Are these likely to be correlated with totwrk? 


The median starting salary for new law school graduates is determined by 


log(salary) = By + B,LSAT + BGPA + B,log(libvol) + B,log(cost) 
+ Bsrank + u, 


where LSAT is the median LSAT score for the graduating class, GPA is the median college GPA for the 
class, libvol is the number of volumes in the law school library, cost is the annual cost of attending law 
school, and rank is a law school ranking (with rank = 1 being the best). 

(i) Explain why we expect B; = 0. 

(ii) What signs do you expect for the other slope parameters? Justify your answers. 

(ii) Using the data in LAWSCH85, the estimated equation is 


8.34 + .0047 LSAT + .248 GPA + .095 log(libvol) 
+ .038 log(cost) — .0033 rank 


(pe T 
log(salary) 


n = 136, R? = .842. 


What is the predicted ceteris paribus difference in salary for schools with a median GPA differ- 
ent by one point? (Report your answer as a percentage.) 

(iv) Interpret the coefficient on the variable log(/ibvol). 

(v) Would you say it is better to attend a higher ranked law school? How much is a difference in 
ranking of 20 worth in terms of predicted starting salary? 


In a study relating college grade point average to time spent in various activities, you distribute a sur- 
vey to several students. The students are asked how many hours they spend each week in four activi- 
ties: studying, sleeping, working, and leisure. Any activity is put into one of the four categories, so that 
for each student, the sum of hours in the four activities must be 168. 

(i) In the model 


GPA = By + Bystudy + B sleep + B3,work + PByleisure + u, 


does it make sense to hold sleep, work, and leisure fixed, while changing study? 
(i) Explain why this model violates Assumption MLR.3. 
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(iii) How could you reformulate the model so that its parameters have a useful interpretation and it 
satisfies Assumption MLR.3? 


Consider the multiple regression model containing three independent variables, under Assumptions 
MLR.1 through MLR.4: 


y = Bo + Bixi + Boxy + Bax; + u. 


You are interested in estimating the sum of the parameters on x, and x; call this 0; = By + By). 
(i) Show that 6, = Ê: + Bo is an unbiased estimator of 04. 
(ii) Find Var(6,) in terms of Var(8,),Var(B,), and Corr(B,, Bo). 


Which of the following can cause OLS estimators to be biased? 

(i) | Heteroskedasticity. 

(ii) Omitting an important variable. 

(iii) A sample correlation coefficient of .95 between two independent variables both included in the 
model. 


Suppose that average worker productivity at manufacturing firms (avgprod) depends on two factors, 
average hours of training (avgtrain) and average worker ability (avgabil): 


avgeprod = By + Byavgtrain + Byavgabil + u. 


Assume that this equation satisfies the Gauss-Markov assumptions. If grants have been given to firms 
whose workers have less than average ability, so that avgtrain and avgabil are negatively correlated, 
what is the likely bias in 8, obtained from the simple regression of avgprod on avgtrain? 


The following equation describes the median housing price in a community in terms of amount of pol- 
lution (nox for nitrous oxide) and the average number of rooms in houses in the community (rooms): 


log(price) = By + Blog(nox) + Brooms + u. 


(i) | What are the probable signs of 8, and B,? What is the interpretation of 6,? Explain. 

(ii) Why might nox [or more precisely, log(nox)] and rooms be negatively correlated? If this is the 
case, does the simple regression of log(price) on log(nox) produce an upward or a downward 
biased estimator of 8,? 

(iii) Using the data in HPRICE2, the following equations were estimated: 


ee a Rg, 

log(price) = 11.71 — 1.043 log(nox), n = 506, R? = .264. 

e ae 

log(price) = 9.23 — .718 log(nox) + .306 rooms, n = 506, R? = .514. 


Is the relationship between the simple and multiple regression estimates of the elasticity of price 
with respect to nox what you would have predicted, given your answer in part (ii)? Does this 
mean that —.718 is definitely closer to the true elasticity than — 1.043? 


Suppose that you are interested in estimating the ceteris paribus relationship between y and x. For this 
purpose, you can collect data on two control variables, x, and x3. (For concreteness, you might think 
of y as final exam score, x, as class attendance, x, as GPA up through the previous semester, and x3 as 
SAT or ACT score.) Let 8, be the simple regression estimate from y on x, and let B, be the multiple 
regression estimate from y on x4, X2, X3. 
(i)  Ifx is highly correlated with x, and x, in the sample, and x, and x; have large partial effects 
on y, would you expect 8, and B, to be similar or very different? Explain. 
(ii) If x, is almost uncorrelated with x, and x3, but x, and x; are highly correlated, will 6, and Êi 
tend to be similar or very different? Explain. 
(iii) If x, is highly correlated with x, and x3, and x, and x, have small partial effects on y, would you 
expect se(,) or se(ĝ;) to be smaller? Explain. 
(iv) If x, is almost uncorrelated with x, and x3, x, and x; have large partial effects on y, and x, and x; 
are highly correlated, would you expect se(,) or se(B,) to be smaller? Explain. 
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Suppose that the population model determining y is 
y = Bo + Bix, + Boxy + B3x3 + u, 


and this model satisfies Assumptions MLR.1 through MLR.4. However, we estimate the model that 
omits x3. Let Bo, B,, and B, be the OLS estimators from the regression of y on x, and x». Show that the 
expected value of B, (given the values of the independent variables in the sample) is 


where the #;, are the OLS residuals from the regression of x, on x. [Hint: The formula for B, comes 
from equation (3.22). Plug y; = By + Bixa + Box. + B3x, + u; into this equation. After some 
algebra, take the expectation treating x; and 7;; as nonrandom. ] 


The following equation represents the effects of tax revenue mix on subsequent employment growth 
for the population of counties in the United States: 


growth = By + B,sharep + B share, + B3share, + other factors, 


where growth is the percentage change in employment from 1980 to 1990, sharep is the share of prop- 
erty taxes in total tax revenue, share, is the share of income tax revenues, and shares is the share of 
sales tax revenues. All of these variables are measured in 1980. The omitted share, sharep, includes 
fees and miscellaneous taxes. By definition, the four shares add up to one. Other factors would include 
expenditures on education, infrastructure, and so on (all measured in 1980). 

(i) | Why must we omit one of the tax share variables from the equation? 

(ii) Give a careful interpretation of B,. 


(i) Consider the simple regression model y = By + Bx + u under the first four Gauss-Markov 
assumptions. For some function g(x), for example g(x) = x? or g(x) = log(1 + x°), define 
zi = g(x;). Define a slope estimator as 


Bi = (Se 7 an) (SG = Zn). 


Show that @, is linear and unbiased. Remember, because E(ulx) = 0, you can treat both x; and z; 
as nonrandom in your derivation. 
(i) Add the homoskedasticity assumption, MLR.5. Show that 


va) = ($6 -2°)/($G- 2s). 


(iii) Show directly that, under the Gauss-Markov assumptions, Var( B) < Var(B,), where B, is the 
OLS estimator. [Hint: The Cauchy-Schwartz inequality in Math Refresher B implies that 


E Nas ») 2 GE = (aS - ) 


notice that we can drop x from the sample covariance. ] 


Suppose you have a sample of size n on three variables, y, x,, and x,, and you are primarily interested 
in the effect of x, on y. Let f, be the coefficient on x, from the simple regression and ĝ; the coefficient 
on x, from the regression y on x,, x2. The standard errors reported by any regression package are 


se(B,) = Tee 
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where @ is the SER from the simple regression, & is the SER from the multiple regression, 
VIF, = 1/(1 — Rf), and R? is the R-squared from the regression of x, on x. Explain why 
se( Bo can be smaller or larger than se(B,). 


15 The following estimated equations use the data in MLB1, which contains information on major league 
baseball salaries. The dependent variable, /salary, is the log of salary. The two explanatory variables 
are years in the major leagues (years) and runs batted in per year (rbisyr): 


es, 
lsalary = 12.373 + .1770 years 
(.098) (.0132) 


n = 353, SSR = 326.196, SER = .964, R? = .337 


jer es, 
lsalary = 11.861 + .0904 years + .0302 rbisyr 
(.084) (.0118) (.0020) 


n = 353, SSR = 198.475, SER = .753, R? = .597 


(i) | How many degrees of freedom are in each regression? Why is the SER smaller in the second 
regression than the first? 

(ii) The sample correlation coefficient between years and rbisyr is about 0.487. Does this make 
sense? What is the variance inflation factor (there is only one) for the slope coefficients in the 
multiple regression? Would you say there is little, moderate, or strong collinearity between years 
and rbisyr? 

(iii) How come the standard error for the coefficient on years in the multiple regression is lower than 
its counterpart in the simple regression? 


16 The following equations were estimated using the data in LAWSCH85: 


Pree 
salary = 9.90 — .0041 rank + .294 GPA 
(.24) (.0003) (.069) 


n = 142, R? = 8238 
—_ ~“*~. 
lsalary = 9.86 — .0038 rank + .295 GPA + .00017 age 
(.29) (.0004) (.083) (.00036) 
n= 99, R? = 8036 
How can it be that the R-squared is smaller when the variable age is added to the equation? 


17 Consider an estimated equation for workers earning an hourly wage, wage,where educ, years of 
schooling, and exper, actual years in the workforce, are measured in years. The dependent variable is 
lwage = log(wage): 


—_—— 
lwage = 0.532 + .094 educ + .026 exper 
n = 932, R? = 0.188 


Suppose that getting one more year of education necessarily reduces workforce experience by one 
year. What is the estimated percentage change in wage from getting one more year of schooling? 


18 The potential outcomes framework in Section 3-7e can be extended to more than two potential out- 
comes. In fact, we can think of the policy variable, w, as taking on many different values, and then y(w) 
denotes the outcome for policy level w. For concreteness, suppose w is the dollar amount of a grant that 
can be used for purchasing books and electronics in college, y(w) is a measure of college performance, 
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such as grade point average. For example, y(0) is the resulting GPA if the student receives no grant 
and y(500) is the resulting GPA if the grant amount is $500. 

For a random draw i, we observe the grant level, w; = 0 and y; = y(w,). As in the binary program 
evaluation case, we observe the policy level, w;, and then only the outcome associated with that level. 
(i) Suppose a linear relationship is assumed: 


y(w) = a + Bw + v(0) 


where y(0) = @ + v. Further, assume that for all i, w; is independent of v;. Show that for each i 
we can write 


yi = a + Bw; + vi 
E(v,|w,) = 0. 


(i) In the setting of part (i), how would you estimate 6 (and a) given a random sample? Justify 
your answer. 
Gii) Now suppose that w; is possibly correlated with v;, but for a set of observed variables x;;, 


E(w; Kies) = E(x, 2 Xp) 5E N E VX Tt + Yik 


The first equality holds if w; is independent of v; conditional on (x;;, . . . , x.) and the second 
equality assumes a linear relationship. Show that we can write 


Yi 5 Y + Pwi + Yxa +o E YX t Ui 
Elu wates = 0. 


What is the intercept y? 
(iv) How would you estimate £ (along with w and the y,) in part (iii)? Explain. 


Computer Exercises 


C1 


C2 


A problem of interest to health officials (and others) is to determine the effects of smoking during 
pregnancy on infant health. One measure of infant health is birth weight; a birth weight that is too 
low can put an infant at risk for contracting various illnesses. Since factors other than cigarette smok- 
ing that affect birth weight are likely to be correlated with smoking, we should take those factors into 
account. For example, higher income generally results in access to better prenatal care, as well as bet- 
ter nutrition for the mother. An equation that recognizes this is 


bwght = By + B,cigs + Bfaminc + u. 


(i) What is the most likely sign for B,? 

(ii) Do you think cigs and famine are likely to be correlated? Explain why the correlation might be 
positive or negative. 

(iii) Now, estimate the equation with and without faminc, using the data in BWGHT. Report the 
results in equation form, including the sample size and R-squared. Discuss your results, focus- 
ing on whether adding faminc substantially changes the estimated effect of cigs on bwght. 


Use the data in HPRICE1 to estimate the model 
price = Bo + B,sqrft + Bobdrms + u, 


where price is the house price measured in thousands of dollars. 

(i) Write out the results in equation form. 

(ii) | What is the estimated increase in price for a house with one more bedroom, holding square 
footage constant? 

(iii) What is the estimated increase in price for a house with an additional bedroom that is 
140 square feet in size? Compare this to your answer in part (ii). 
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C3 


C4 


C5 


C6 


C7 


(iv) What percentage of the variation in price is explained by square footage and number 
of bedrooms? 

(v) The first house in the sample has sqrft = 2,438 and bdrms = 4. Find the predicted selling price 
for this house from the OLS regression line. 

(vi) The actual selling price of the first house in the sample was $300,000 (so price = 300). Find 
the residual for this house. Does it suggest that the buyer underpaid or overpaid for the house? 


The file CEOSAL2 contains data on 177 chief executive officers and can be used to examine 

the effects of firm performance on CEO salary. 

(i) Estimate a model relating annual salary to firm sales and market value. Make the model 
of the constant elasticity variety for both independent variables. Write the results out 
in equation form. 

(ii) Add profits to the model from part (i). Why can this variable not be included in logarithmic 
form? Would you say that these firm performance variables explain most of the variation 
in CEO salaries? 

(iii) Add the variable ceoten to the model in part (ii). What is the estimated percentage return for 
another year of CEO tenure, holding other factors fixed? 

(iv) Find the sample correlation coefficient between the variables log(mktval) and profits. Are these 
variables highly correlated? What does this say about the OLS estimators? 


Use the data in ATTEND for this exercise. 
(i) Obtain the minimum, maximum, and average values for the variables atndrte, priGPA, and ACT. 
(ii) Estimate the model 


atndrte = By + B\priGPA + BACT + u, 


and write the results in equation form. Interpret the intercept. Does it have a useful meaning? 
Gii) Discuss the estimated slope coefficients. Are there any surprises? 
(iv) What is the predicted atndrte if priGPA = 3.65 and ACT = 20? What do you make of this 

result? Are there any students in the sample with these values of the explanatory variables? 
(v) If Student A has priGPA = 3.1 and ACT = 21 and Student B has priGPA = 2.1 and 

ACT = 26, what is the predicted difference in their attendance rates? 


Confirm the partialling out interpretation of the OLS estimates by explicitly doing the partialling out 
for Example 3.2. This first requires regressing educ on exper and tenure and saving the residuals, 7. 
Then, regress log(wage) on 7;. Compare the coefficient on 7, with the coefficient on educ in the regres- 
sion of log(wage) on educ, exper, and tenure. 


Use the data set in WAGE2 for this problem. As usual, be sure all of the following regressions contain 
an intercept. 
(i) Runa simple regression of JQ on educ to obtain the slope coefficient, say, 5. 
(ii) Run the simple regression of log(wage) on educ, and obtain the slope coefficient, B,. 
(iii) Run the multiple regression of log(wage) on educ and JQ, and obtain the slope coefficients, 
B ı and Bo respectively. 
(iv) Verify that B, = B, F ÊÒ.. 


Use the data in MEAP93 to answer this question. 
(i) Estimate the model 


mathl0 = By + Bılog(expend) + Blnchprg + u, 


and report the results in the usual form, including the sample size and R-squared. Are the signs 
of the slope coefficients what you expected? Explain. 

(ii) What do you make of the intercept you estimated in part (i)? In particular, does it make sense to 
set the two explanatory variables to zero? [Hint: Recall that log(1) = 0.] 


c8 


c9 


C10 


(iii) 


(iv) 
(v) 
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Now run the simple regression of math10 on log(expend), and compare the slope coefficient 
with the estimate obtained in part (i). Is the estimated spending effect now larger or smaller 
than in part (i)? 

Find the correlation between lexpend = log(expend) and Inchprg. Does its sign make sense to you? 
Use part (iv) to explain your findings in part (iii). 


Use the data in DISCRIM to answer this question. These are ZIP code—level data on prices for vari- 
ous items at fast-food restaurants, along with characteristics of the zip code population, in New Jersey 
and Pennsylvania. The idea is to see whether fast-food restaurants charge higher prices in areas with a 
larger concentration of blacks. 


(i) 


(ii) 


(iii) 


(iv) 


(v) 
(vi) 
(vii) 


Find the average values of prpblck and income in the sample, along with their standard devia- 
tions. What are the units of measurement of prpbick and income? 

Consider a model to explain the price of soda, psoda, in terms of the proportion of the popula- 
tion that is black and median income: 


psoda = By + Byprpblck + B,income + u. 


Estimate this model by OLS and report the results in equation form, including the sample size 
and R-squared. (Do not use scientific notation when reporting the estimates.) Interpret the coef- 
ficient on prpbick. Do you think it is economically large? 

Compare the estimate from part (ii) with the simple regression estimate from psoda on prpblck. 
Is the discrimination effect larger or smaller when you control for income? 

A model with a constant price elasticity with respect to income may be more appropriate. 
Report estimates of the model 


log(psoda) = By + Bıprpblck + B,log(income) + u. 


If prpblck increases by .20 (20 percentage points), what is the estimated percentage change 
in psoda? (Hint: The answer is 2.xx, where you fill in the “xx.”) 

Now add the variable prppov to the regression in part (iv). What happens to Big? 

Find the correlation between log(income) and prppov. Is it roughly what you expected? 
Evaluate the following statement: “Because log(income) and prppov are so highly correlated, 
they have no business being in the same regression.” 


Use the data in CHARITY to answer the following questions: 


(i) 


(ii) 


(iii) 
(iv) 


(v) 


Estimate the equation 


gift = Po + B,mailsyear + B-giftlast + Bspropresp + u 


by OLS and report the results in the usual way, including the sample size and R-squared. 

How does the R-squared compare with that from the simple regression that omits giftlast 

and propresp? 

Interpret the coefficient on mailsyear. Is it bigger or smaller than the corresponding simple 
regression coefficient? 

Interpret the coefficient on propresp. Be careful to notice the units of measurement of propresp. 
Now add the variable avggift to the equation. What happens to the estimated effect 

of mailsyear? 

In the equation from part (iv), what has happened to the coefficient on giftlast? What do you 
think is happening? 


Use the data in HTV to answer this question. The data set includes information on wages, education, 
parents’ education, and several other variables for 1,230 working men in 1991. 


(i) 


What is the range of the educ variable in the sample? What percentage of men completed 
twelfth grade but no higher grade? Do the men or their parents have, on average, higher levels 
of education? 
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C12 


(ii) 


(iii) 


(iv) 


(v) 


(vi) 


Estimate the regression model 
educ = By + B,motheduc + B,fatheduc + u 


by OLS and report the results in the usual form. How much sample variation in educ is 
explained by parents’ education? Interpret the coefficient on motheduc. 

Add the variable abil (a measure of cognitive ability) to the regression from part (ii), and report 
the results in equation form. Does “ability” help to explain variations in education, even after 
controlling for parents’ education? Explain. 

(Requires calculus) Now estimate an equation where abil appears in quadratic form: 


educ = By) + B,motheduc + B,fatheduc + Babil + B,abiP + u. 


Using the estimates Ê; and Bas use calculus to find the value of abil, call it abil’, where educ 

is minimized. (The other coefficients and values of parents’ education variables have no effect; 
we are holding parents’ education fixed.) Notice that abil is measured so that negative values 
are permissible. You might also verify that the second derivative is positive so that you do 
indeed have a minimum. 

Argue that only a small fraction of men in the sample have “ability” less than the value calcu- 
lated in part (iv). Why is this important? 

If you have access to a statistical program that includes graphing capabilities, use the estimates 
in part (iv) to graph the relationship between the predicted education and abil. Set motheduc and 
fatheduc at their average values in the sample, 12.18 and 12.45, respectively. 


Use the data in MEAPSINGLE to study the effects of single-parent households on student math per- 
formance. These data are for a subset of schools in southeast Michigan for the year 2000. The socio- 
economic variables are obtained at the ZIP code level (where ZIP code is assigned to schools based on 
their mailing addresses). 


(i) 

(ii) 
(iii) 
(iv) 


(v) 


Run the simple regression of math4 on pctsgle and report the results in the usual format. 
Interpret the slope coefficient. Does the effect of single parenthood seem large or small? 

Add the variables /medinc and free to the equation. What happens to the coefficient on pctsgle? 
Explain what is happening. 

Find the sample correlation between /medinc and free. Does it have the sign you expect? 

Does the substantial correlation between /medinc and free mean that you should drop one from 
the regression to better estimate the causal effect of single parenthood on student performance? 
Explain. 

Find the variance inflation factors (VIFs) for each of the explanatory variables appearing 

in the regression in part (ii). Which variable has the largest VIF? Does this knowledge 

affect the model you would use to study the causal effect of single parenthood on math 
performance? 


The data in ECONMATH contain grade point averages and standardized test scores, along with 


performance in an introductory economics course, for students at a large public university. The vari- 


able to be explained is score, the final score in the course measured as a percentage. 


(i) 
(ii) 
(iii) 


(iv) 


How many students received a perfect score for the course? What was the average score? Find 
the means and standard deviations of actmth and acteng, and discuss how they compare. 
Estimate a linear equation relating score to colgpa, actmth, and acteng, where colgpa is 
measured at the beginning of the term. Report the results in the usual form. 

Would you say the math or English ACT score is a better predictor of performance in the 
economics course? Explain. 

Discuss the size of the R-squared in the regression. 
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C13 Use the data in GPA1 to answer this question. We can compare multiple regression estimates, where 


we control for student achievement and background variables, and compare our findings with the 
difference-in-means estimate in Computer Exercise C11 in Chapter 2. 
(i) Inthe simple regression equation 


colIGPA = By + B\PC + u 


obtain Bo and B |. Interpret these estimates. 

Gi) Now add the controls hsGPA and ACT—that is, run the regression colGPA on PC, hsGPA, and 
ACT. Does the coefficient on PC change much from part (ii)? Does Picea make sense? 

(iii) In the estimation from part (ii), what is worth more: Owning a PC or having 10 more points on 
the ACT score? 

(iv) Now to the regression in part (ii) add the two binary indicators for the parents being college 
graduates. Does the estimate of 8, change much from part (11)? How much variation are you 
explaining in colGPA? 

(v) Suppose someone looking at your regression from part (iv) says to you, “The variables hsGPA 
and ACT are probably pretty highly correlated, so you should drop one of them from the 
regression.” How would you respond? 


APPENDIX 3A 


3A.1 Derivation of the First Order Conditions in Equation (3.13) 


The analysis is very similar to the simple regression case. We must characterize the solutions to 
the problem 


n 


Oi bo — bixa — 0 bixi). 


min 
bobi.. bg j=] ~ 


Taking the partial derivatives with respect to each of the b; (see Math Refresher A), evaluating them 
at the solutions, and setting them equal to zero gives 


2>(y, Bo Bixa úi Bixa) = 0 


-2 xy; Bo — Bix — e — Êx) = 0, forall j= 1,...,k. 


Canceling the —2 gives the first order conditions in (3.13). 


3A.2 Derivation of Equation (3.22) 


To derive (3.22), write x; in terms of its fitted value and its residual from the regression of xı on 
Xo, -3 Xg Xa = ĝa + Fy, for alli = 1,...,n. Now, plug this into the second equation in (3.13): 


È Ga H Ray: Bo Bixa _ Bixa) =0. [3.76] 


By the definition of the OLS residual i#;, because %,, is just a linear function of the explanatory vari- 
ables xj, ... , Xj, it follows that X- ĉĉ; = 0. Therefore, equation (3.76) can be expressed as 


n me 


Ray Bo Bixa om Burn) = 0. [3.77] 


114 PART1 Regression Analysis with Cross-Sectional Data 


Because the ?,, are the residuals from regressing x; On Xz... , Xp j=1%, X;jfa = 0, for all 
j =2,...,k. Therefore, (3.77) is equivalent to >; ĉa (y; — Bixa) = 0. Finally, we use the fact that 
D1 £17 = 0, which means that Êi solves 


aO; = Bifa) =0 


Now, straightforward algebra gives (3.22), provided, of course, that }7_, #7, > 0; this is ensured by 
Assumption MLR.3. 


3A.3 Proof of Theorem 3.1 


We prove Theorem 3.1 for B,; the proof for the other slope parameters is virtually identical. (See 
Advanced Treatment E for a more succinct proof using matrices.) Under Assumption MLR.3, the OLS 
estimators exist, and we can write Êi as in (3.22). Under Assumption MLR.1, we can write y; as in 
(3.32); substitute this for y; in (3.22). Then, using È;-:ĉ1 = 0, Èi- x;fa = 0, for all j = 2,...,k, 
and Xii x7; = X14, we have 


ĝi =p + (Stua) /( Sa). [3.78] 


Now, under Assumptions MLR.2 and MLR.4, the expected value of each u;, given all independent 
variables in the sample, is zero. Because the Ŷ; are just functions of the sample independent vari- 


ables, it follows that 
B(BIX) = 6: + ( Saeta) /( S) 


-s (Sia) ($a) =e 


where X denotes the data on all independent variables and E( B,IX) is the expected value of B 1 given 
Xis -< -> Xip for alli = 1,...,. This completes the proof. 


3A.4 General Omitted Variable Bias 


We can derive the omitted variable bias in the general model in equation (3.31) under the first four 


Gauss-Markov assumptions. In particular, let the Ê, j=0,1,...,k be the OLS estimators from the 
regression using the full set of explanatory variables. Let the B, j=0,1,...,k— 1 be the OLS 
estimators from the regression that leaves out x,. Let Ò, j=1,...,k — 1 bethe slope coefficient 
on x; in the auxiliary regression of x; ON Xj, Xiz, - <- Xix-1 Í = 1,...,n. A useful fact is that 

B, = Ê; an Bò, [3.79] 


This shows explicitly that, when we do not control for x, in the regression, the estimated partial 
effect of x; equals the partial effect when we include x, plus the partial effect of x, on ĵ times the 
partial telationahip between the omitted variable, x,, and x;, j < k. Conditional on the entire set of 
explanatory variables, X, we know that the Ê; are all ibiased for the corresponding £, j = 1,...,k. 
Further, because ò, is just a function of X, we have 


E(B|X) = E(X) + E(ÊdX)S; 


Baoa [3.80] 
= Pj KO j- 
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Equation (3.80) shows that B, is biased for 6, unless 6, = O—in which case x, has no partial effect 
in the population—or 6; equals zero, which means that x, and x; are partially uncorrelated in the 
sample. The key to obtaining equation (3.80) is equation (3.79). To show equation (3.79), we can use 
equation (3.22) a couple of times. For simplicity, we look at j = 1. Now, B, is the slope coefficient 
in the simple regression of y; on 7, i = 1,...,, where the 7;,; are the OLS residuals from the 
regression of x; ON Xj, Xj3, - . - , Xiz- 1; Consider the numerator of the expression for B: X4; Fay: 


But for each i, we can write y; = By 4 Êixa a Brn + û; and plug in for y; Now, by 
properties of the OLS residuals, the 7;, have zero sample average and are uncorrelated with 
Xiz Xiz» - - +» X;z4—1, in the sample. Similarly, the #; have zero sample average and zero sample correla- 
tion with x;;, X; ..., X It follows that the 7;, and i; are uncorrelated in the sample (because the F; 
are just linear combinations of Xj, Xj, - - - , X;4,—1). SO 


D Tayi = a( Sram) + A Srna). [3.81] 
Now, Xi- Faxa = B; 77, which is also the denominator of B. Therefore, we have shown that 
Bi = Bi a a Suns) /( $ra.) 


= Bi T Bio). 


This is the relationship we wanted to show. 


3A.5 Proof of Theorem 3.2 


Again, we prove this for j = 1. Write Êi as in equation (3.78). Now, under MLR.5, Var(u|X) = o’, 
for alli = 1, ...,n. Under random sampling, the u; are independent, even conditional on X, and the 
r;; are nonrandom conditional on X. Therefore, 


(Savas) /(S) 
(Se (Sn) - ($8) 


Now, because >?_,7 is the sum of squared residuals from regressing x; on x), ... , Xj, 
"1? = SST,(1 — Rj). This completes the proof. 


Var(B;|X) 


3A.6 Proof of Theorem 3.4 


We show that, for any other linear unbiased estimator B; of B,, Var(8,;) = Var(B,), where B, is the 
OLS estimator. The focus on j = 1 is without loss of generality. 
For B, as in equation (3.60), we can plug in for y; to obtain 


oN n n n n n 
b= Bod wa T Bi È wax + Bo Dwar eta PÈ wax T È wau: 
= t= J= (= {= 
Now, because the w; are functions of the x;;, 
ey n n n n n 
E(B,|X) = Bo Wi + Bi wari F Br 2 waxn aa By DWirXix T > wirE(ulX) 
i= i= i= i= i= 


n n n n 
= Bod wa + Bi Swink + Bo 2 Wira aes ate By Dd WiXa 
i= i= i= i= 
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because E(u,|X) = 0, for all i = 1,..., under MLR.2 and MLR.4. Therefore, for E(6,|X) to 
equal £; for any values of the parameters, we must have 


X wa = 0, X wixa = 1, DY waxy = 0, 7 = 2,...,4. [3.82] 
= 1 1 
Now, let 7, be the residuals from the regression of x;, on xj, . . . , Xx. Then, from (3.82), it follows that 
wari = 1 [3.83] 
{=l 


because x; = % + 7 and X;-ıwaåĝa = 0. Now, consider the difference between Var(B,|X) and 
Var(ĝ,|X) under MLR.1 through MLR.5: 


Pwi -o / (3), [3.84] 


Because of (3.83), we can write the difference in (3.84), without o7, as 


Sui - ( Situ) / ( Sa). [3.85] 


$ (wa — Fa), [3.86] 


But (3.85) is simply 


where J, = (X;-ıwafa)/( 2-171), as can be seen by squaring each term in (3.86), summing, and 
then canceling terms. Because (3.86) is just the sum of squared residuals from the simple regression 
of w; on 7;,—remember that the sample average of 7;; is zero—(3.86) must be nonnegative. This 
completes the proof. 


CHAPTER 4. = = 


Multiple Regression 
Analysis: Inference 


his chapter continues our treatment of multiple regression analysis. We now turn to the problem 
of testing hypotheses about the parameters in the population regression model. We begin in 
Section 4-1 by finding the distributions of the OLS estimators under the added assumption that 
the population error is normally distributed. Sections 4-2 and 4-3 cover hypothesis testing about indi- 
vidual parameters, while Section 4-4 discusses how to test a single hypothesis involving more than 
one parameter. We focus on testing multiple restrictions in Section 4-5 and pay particular attention to 


determining whether a group of independent variables can be omitted from a model. 


4-1 Sampling Distributions of the OLS Estimators 


Up to this point, we have formed a set of assumptions under which OLS is unbiased; we have also 
derived and discussed the bias caused by omitted variables. In Section 3-4, we obtained the variances 
of the OLS estimators under the Gauss-Markov assumptions. In Section 3-5, we showed that this vari- 
ance is smallest among linear unbiased estimators. 

Knowing the expected value and variance of the OLS estimators is useful for describing the 
precision of the OLS estimators. However, in order to perform statistical inference, we need to know 
more than just the first two moments of Ê; we need to know the full sampling distribution of the Bi. 
Even under the Gauss-Markov assumptions, the distribution of 6; can have virtually any shape. 

When we condition on the values of the independent variables in our sample, it is clear that the 
sampling distributions of the OLS estimators depend on the underlying distribution of the errors. To 
make the sampling distributions of the Ê; tractable, we now assume that the unobserved error is nor- 
mally distributed in the population. We call this the normality assumption. 
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Assumption MLR.6 Normality 


The population error u is independent of the explanatory variables x4, X2,..., X, and is normally distrib- 
uted with zero mean and variance o°: u ~ Normal(0,a7). 


Assumption MLR.6 is much stronger than any of our previous assumptions. In 
fact, because u is independent of the x; under MLR.6, E(ulx,...,x%,) = E(u) = 0 and 
Var(ulx),...,%,) = Var(u) = o°. Thus, if we make Assumption MLR.6, then we are necessarily 
assuming MLR.4 and MLR.5. To emphasize that we are assuming more than before, we will refer to 
the full set of Assumptions MLR.1 through MLR.6. 

For cross-sectional regression applications, Assumptions MLR.1 through MLR.6 are called 
the classical linear model (CLM) assumptions. Thus, we will refer to the model under these six 
assumptions as the classical linear model. It is best to think of the CLM assumptions as containing 
all of the Gauss-Markov assumptions plus the assumption of a normally distributed error term. 

Under the CLM assumptions, the OLS estimators Bos B PER Be have a stronger efficiency prop- 
erty than they would under the Gauss-Markov assumptions. It can be shown that the OLS estimators are 
the minimum variance unbiased estimators, which means that OLS has the smallest variance among 
unbiased estimators; we no longer have to restrict our comparison to estimators that are linear in the y;. 
This property of OLS under the CLM assumptions is discussed further in Advanced Treatment E. 

A succinct way to summarize the population assumptions of the CLM is 


ylx ~ Normal(By + Bixi + Box. +- + Byx,07), 


where x is again shorthand for (x1, ness X). Thus, conditional on x, y has a normal distribution with 
mean linear in x), . . . , x, and a constant variance. For a single independent variable x, this situation is 
shown in Figure 4.1. 


Figure 4.1 The homoskedastic normal distribution with a single explanatory variable. 


normal distributions 


—_— x 


E(ylx) = By + B,x 


THEOREM 
4.1 
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The argument justifying the normal distribution for the errors usually runs something like this: 
Because u is the sum of many different unobserved factors affecting y, we can invoke the central limit 
theorem (CLT) (see Math Refresher C) to conclude that u has an approximate normal distribution. 
This argument has some merit, but it is not without weaknesses. First, the factors in u can have very 
different distributions in the population (for example, ability and quality of schooling in the error in a 
wage equation). Although the CLT can still hold in such cases, the normal approximation can be poor 
depending on how many factors appear in u and how different their distributions are. 

A more serious problem with the CLT argument is that it assumes that all unobserved factors 
affect y in a separate, additive fashion. Nothing guarantees that this is so. If u is a complicated func- 
tion of the unobserved factors, then the CLT argument does not really apply. 

In any application, whether normality of u can be assumed is really an empirical matter. For 
example, there is no theorem that says wage conditional on educ, exper, and tenure is normally dis- 
tributed. If anything, simple reasoning suggests that the opposite is true: because wage can never 
be less than zero, it cannot, strictly speaking, have a normal distribution. Further, because there are 
minimum wage laws, some fraction of the population earns exactly the minimum wage, which also 
violates the normality assumption. Nevertheless, as a practical matter, we can ask whether the condi- 
tional wage distribution is “close” to being normal. Past empirical evidence suggests that normality is 
not a good assumption for wages. 

Often, using a transformation, especially taking the log, yields a distribution that is closer to 
normal. For example, something like log(price) tends to have a distribution that looks more normal 
than the distribution of price. Again, this is an empirical issue. We will discuss the consequences of 
nonnormality for statistical inference in Chapter 5. 

There are some applications where MLR.6 is clearly false, as can be demonstrated with simple 
introspection. Whenever y takes on just a few values it cannot have anything close to a normal dis- 
tribution. The dependent variable in Example 3.5 provides a good example. The variable narr86, the 
number of times a young man was arrested in 1986, takes on a small range of integer values and is 
zero for most men. Thus, narr86 is far from being normally distributed. What can be done in these 
cases? As we will see in Chapter 5—and this is important—nonnormality of the errors is not a serious 
problem with large sample sizes. For now, we just make the normality assumption. 

Normality of the error term translates into normal sampling distributions of the OLS estimators: 


NORMAL SAMPLING DISTRIBUTIONS 


Under the CLM assumptions MLR.1 through MLR.6, conditional on the sample values of the indepen- 
dent variables, 


Ê; ~ Normal[g;,Var(ĝ,)], [4.1] 


where Var( B)) was given in Chapter 3 [equation (3.51)]. Therefore, 


(Ê; — B,)/sd(B,) ~ Normal(0,1). 


The proof of (4.1) is not that difficult, given the properties of normally distributed random variables in 
Math Refresher B. Each Ê; can be written as Ê, = By + Se \W,lt;, where w; = 7;;/SSR,, 7,18 the i” resid- 

ual from the regression of the x; on all the other independent variables, sind SSR, is the ain of squared 
residuals from this regression je equation (3.65)]. Because the w;; depend only on the independent vari- 
ables, they can be treated as nonrandom. Thus, Bi is just a linear combination of the errors in the sample 
{u;i = 1,2,...,n}. Under Assumption MLR.6 (and the random sampling Assumption MLR.2), the 
errors are independent, identically distributed Normal(0,o*) random variables. An important fact about 
independent normal random variables is that a linear combination of such random variables is normally 


120 PART1 Regression Analysis with Cross-Sectional Data 


r distributed (see Math Refresher B). This basically 
completes the proof. In Section 3-3, we showed that 
Suppose that u is independent of the | E(Ê,) = B, and we derived Var(ĝ,) in Section 3-4; 
explanatory variables, and it takes on the | there is no need to re-derive these facts. 
values —2, —1, 0, 1, and 2 with equal prob- The second part of this theorem follows imme- 
ability of 1/5. Does this violate the Gauss- diately from the fact that when we standardize a nor- 
Markov assumptions? Does this violate the l¢and ‘able by sabtrachne off ate imeanani 
CLM assumptions? Py aage aes y ee i 
dividing by its standard deviation, we end up with a 
standard normal random variable. 

The conclusions of Theorem 4.1 can be strengthened. In addition to (4.1), any linear combina- 
tion of the Bos Êi, sea By is also normally distributed, and any subset of the Ê; has a joint normal 
distribution. These facts underlie the testing results in the remainder of this chapter. In Chapter 5, we 
will show that the normality of the OLS estimators is still approximately true in large samples even 
without normality of the errors. 


4-2 Testing Hypotheses about a Single Population Parameter: The f Test 


THEOREM 
4.2 


This section covers the very important topic of testing hypotheses about any single parameter in the 
population regression function. The population model can be written as 


y = Bo + Bix, tee + Bex, + u, [4.2] 


and we assume that it satisfies the CLM assumptions. We know that OLS produces unbiased estima- 
tors of the 6;. In this section, we study how to test hypotheses about a particular 6;. For a full under- 
standing of hypothesis testing, one must remember that the £, are unknown features of the population, 
and we will never know them with certainty. Nevertheless, we can hypothesize about the value of 6; 
and then use statistical inference to test our hypothesis. 

In order to construct hypotheses tests, we need the following result: 


t DISTRIBUTION FOR THE STANDARDIZED ESTIMATORS 
Under the CLM assumptions MLR.1 through MLR.6, 


(Ê; = B,)/se(B)) ~ th-n-1 = tap 


where k + 1 is the number of unknown parameters in the population model y = Bo + Bix, + °°: 
BX, + u (k slope parameters and the intercept By) and n — k — 1 is the degrees of freedom (df. 


This result differs from Theorem 4.1 in some notable respects. Theorem 4.1 showed that, under the 
CLM assumptions, (È; — B;)/sd(B;) ~ Normal(0,1). The ¢ distribution in (4.3) comes from the 
fact that the constant o in sd( B;) has been replaced with the random variable &. The proof that this 
leads to a f distribution with n — k — 1 degrees of freedom is difficult and not especially instructive. 
Essentially, the proof shows that (4.3) can be written as the ratio of the standard normal random vari- 
able ( Ê, = B;)/ sd( Ê j) over the square root of 6?/a7. These random variables can be shown to be inde- 
pendent, and (n — k — 1)67/o? ~ y?_,_,. The result then follows from the definition of a t random 
variable (see Section B-5 in Math Refresher B). 

Theorem 4.2 is important in that it allows us to test hypotheses involving the £;. In most applica- 
tions, our primary interest lies in testing the null hypothesis 


Ho: 6; = 0, [4.4] 
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where j corresponds to any of the k independent variables. It is important to understand what (4.4) 
means and to be able to describe this hypothesis in simple language for a particular application. 
Because 6, measures the partial effect of x; on (the eee value of) y, after controlling for all other 
independent variables, (4.4) means that, once x), X2, ... , Xj- Xj+1 +++» Xy have been accounted for, 
x; has no effect on the expected value of y. We cannot state the null hypothesis as “x; does have a par- 
tal effect on y” because this is true for any value of 6; other than zero. Classical testing is suited for 
testing simple hypotheses like (4.4). 


As an example, consider the wage equation 
log(wage) = By + B,educ + Bsexper + B3tenure + u. 


The null hypothesis Hy: 8; = O means that, once education and tenure have been accounted for, the 
number of years in the workforce (exper) has no effect on hourly wage. This is an economically inter- 
esting hypothesis. If it is true, it implies that a person’s work history prior to the current employment 
does not affect wage. If 6, > 0, then prior work experience contributes to productivity, and hence 
to wage. 

You probably remember from your statistics course the rudiments of hypothesis testing for the 
mean from a normal population. (This is reviewed in Math Refresher C.) The mechanics of testing (4.4) 
in the multiple regression context are very similar. The hard part is obtaining the coefficient estimates, 
the standard errors, and the critical values, but most of this work is done automatically by econometrics 
software. Our job is to learn how regression output can be used to test hypotheses of interest. 

The statistic we use to test (4.4) (against any alternative) is called “the” t statistic or “the” t ratio 
of Ê, and is defined as 


1, = B/se(B;). [4.5] 
We have put “the” in quotation marks because, as we will see shortly, a more general form of the 
t statistic is needed for testing other hypotheses about 6,. For now, it is important to know that (4.5) 
is suitable only for testing (4.4). For particular applications, it is helpful to index ¢ statistics using the 
name of the independent variable; for example, tegue would be the ¢ statistic for Bouc 

The ż statistic for B;i is simple to compute given Ê j and its standard error. In fact, most regression 
packages do the division for you and report the ż statistic along with each coefficient and its standard 
error. 

Before discussing how to use (4.5) formally to test Ho: B; = 0, it is useful to see why tg has 
features that make it reasonable as a test statistic to detect 6, # 0. First, because se(B)) is always 
positive, tg has the same sign as Ê; if ĝi is positive, then so is fg, and if Bi is negative, so is tg, Second, 
for a given value of se( Bi), a larger value of Ê; leads to larger values of tg. If Ê; becomes more nega- 
tive, so does fg. 

Because we are testing Hp: 6; = 0, it is only natural to look at our unbiased estimator of $, Bi, 
for guidance. In any interesting application, the point estimate Ê; will never exactly be zero, whether 
or not Hp is true. The question is: How far is Ê; from zero? A sample value of Ê; very far from zero 
provides evidence against Hy: 8; = 0. However, we must recognize that there is a sampling error in 
our estimate Ê, so the size of 6; must be weighed against its sampling error. Because the standard 
error of B; ii is an estimate of the standard deviation of Bi, tg, measures how many estimated standard 
deviations Ê; jis away from zero. This is precisely what we do in testing whether the mean of a popula- 
tion is zero, using the standard ż statistic from introductory statistics. Values of tġ_ sufficiently far from 
zero will result in a rejection of Hy. The precise rejection rule depends on the alternative hypothesis 
and the chosen significance level of the test. 

Determining a rule for rejecting (4.4) at a given significance level—that is, the probability of 
rejecting Hy when it is ttue—requires knowing the sampling distribution of 7g when Hp is true. From 
Theorem 4.2, we know this to be ¢,_,_,. This is the key theoretical result needed for testing (4.4). 

Before proceeding, it is important to remember that we are testing hypotheses about the popula- 
tion parameters. We are not testing hypotheses about the estimates from a particular sample. Thus, it 
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never makes sense to state a null hypothesis as “Hgo: Êi = 0” or, even worse, as “Hy: .237 = 0” when 
the estimate of a parameter is .237 in the sample. We are testing whether the unknown population 
value, 64, is zero. 

Some treatments of regression analysis define the ż statistic as the absolute value of (4.5), so that 
the f statistic is always positive. This practice has the drawback of making testing against one-sided 
alternatives clumsy. Throughout this text, the f statistic always has the same sign as the corresponding 
OLS coefficient estimate. 


4-2a Testing against One-Sided Alternatives 


To determine a rule for rejecting Hp, we need to decide on the relevant alternative hypothesis. First, 
consider a one-sided alternative of the form 


Hi: B > 0. [4.6] 


When we state the alternative as in equation (4.6), we are really saying that the null hypothesis is 
Ho: 6; = 0. For example, if 6; is the coefficient on education in a wage regression, we only care about 
detecting that 6; is different from zero when BG; is actually positive. You may remember from introduc- 
tory statistics that the null value that is hardest to reject in favor of (4.6) is 6; = 0. In other words, if we 
reject the null 6; = 0 then we automatically reject 6; < 0. Therefore, it suffices to act as if we are testing 
Ho: 6; = 0 against H,: 6; > 0, effectively ignoring 6; < 0, and that is the approach we take in this book. 

How should we choose a rejection rule? We must first decide on a significance level (‘“level” 
for short) or the probability of rejecting Hy when it is in fact true. For concreteness, suppose we have 
decided on a 5% significance level, as this is the most popular choice. Thus, we are willing to mistak- 
enly reject Hp when it is true 5% of the time. Now, while ¢g has a ¢ distribution under Hp—so that it 
has zero mean—under the alternative 6; > 0, the expected value of tg, is positive. Thus, we are look- 
ing for a “sufficiently large” positive value of tẹ, in order to reject Ho: 6; = 0 in favor of H,: B; > 0. 
Negative values of tg provide no evidence in favor of H}. 

The definition of “sufficiently large,” with a 5% significance level, is the 95th percentile in a 
t distribution with n — k — 1 degrees of freedom; denote this by c. In other words, the rejection rule 
is that Ho is rejected in favor of H; at the 5% significance level if 


>. [4.7] 


By our choice of the critical value, c, rejection of Hp will occur for 5% of all random samples when 
Hp is true. 

The rejection rule in (4.7) is an example of a one-tailed test. To obtain c, we only need the sig- 
nificance level and the degrees of freedom. For example, for a 5% level test and withn — k — 1 = 28 
degrees of freedom, the critical value is c = 1.701. If tg = 1.701, then we fail to reject Hy in favor of 
(4.6) at the 5% level. Note that a negative value for tg, no matter how large in absolute value, leads to 
a failure in rejecting Hp in favor of (4.6). (See Figure 4.2.) 

The same procedure can be used with other significance levels. For a 10% level test and if 
df = 21, the critical value is c = 1.323. For a 1% significance level and if df = 21, c = 2.518. All 
of these critical values are obtained directly from Table G.2. You should note a pattern in the critical 
values: as the significance level falls, the critical value increases, so that we require a larger and larger 
value of fg, in order to reject Ho. Thus, if Ho is rejected at, say, the 5% level, then it is automatically 
rejected at the 10% level as well. It makes no sense to reject the null hypothesis at, say, the 5% level 
and then to redo the test to determine the outcome at the 10% level. 

As the degrees of freedom in the ż distribution get large, the ¢ distribution approaches the standard 
normal distribution. For example, when n — k — 1 = 120, the 5% critical value for the one-sided 
alternative (4.7) is 1.658, compared with the standard normal value of 1.645. These are close enough 
for practical purposes; for degrees of freedom greater than 120, one can use the standard normal 
critical values. 


CHAPTER 4 Multiple Regression Analysis: Inference 123 


Figure 4.2 5% rejection rule for the alternative H,: 6; > 0 with 28 df. 


area = .05 


> 
1.701 rejection 
region 


Hourly Wage Equation 
Using the data in WAGE] gives the estimated equation 


i aia, 
log(wage) = .284 + .092 educ + .0041 exper + .022 tenure 
(.104) (.007) (.0017) (.003) 
n = 526, R? = .316, 


where standard errors appear in parentheses below the estimated coefficients. We will follow this 
convention throughout the text. This equation can be used to test whether the return to exper, control- 
ling for educ and tenure, is zero in the population, against the alternative that it is positive. Write this 
as Ho: Bexper = O versus Hy: Bexper > O. (In applications, indexing a parameter by its associated vari- 
able name is a nice way to label parameters, because the numerical indices that we use in the general 
model are arbitrary and can cause confusion.) Remember that Pexper denotes the unknown population 
parameter. It is nonsense to write “Ho: .0041 = 0” or “Ho: Bexper = 0." 

Because we have 522 degrees of freedom, we can use the standard normal critical values. The 5% 
critical value is 1.645, and the 1% critical value is 2.326. The t statistic for Boge is 


texper = -0041/.0017 = 2.41, 


exper 


and so Bis or exper, is Statistically significant even at the 1% level. We also say that “Bopa is statis- 
tically greater than zero at the 1% significance level.” 

The estimated return for another year of experience, holding tenure and education fixed, is not 
especially large. For example, adding three more years increases log(wage) by 3(.0041) = .0123, so 
wage is only about 1.2% higher. Nevertheless, we have persuasively shown that the partial effect of 
experience is positive in the population. 
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The one-sided alternative that the parameter is less than zero, 


= _ GOING FURTHER 4.2 


Let community loan approval rates be deter- 
mined by 


apprate = By + B,percmin + B avginc 
+ Bavgwith + B,avgdebt + u, 


where percmin is the percentage minority in 
the community, avginc is average income, 
avgwith is average wealth, and avgdebt is 
some measure of average debt obligations. 
How do you state the null hypothesis that 
there is no difference in loan rates across 
neighborhoods due to racial and ethnic 
composition, when average income, aver- 
age wealth, and average debt have been 
controlled for? How do you state the alter- 
native that there is discrimination against 


minorities in loan approval rates? 


H,: B < 0, [4.8] 


also arises in applications. The rejection rule for 
alternative (4.8) is just the mirror image of the previ- 
ous case. Now, the critical value comes from the left 
tail of the ¢ distribution. In practice, it is easiest to 
think of the rejection rule as 


pat [4.9] 


where c is the critical value for the alternative 
H,: 6; > 0. For simplicity, we always assume c 
is positive, because this is how critical values are 
reported in ¢ tables, and so the critical value —c is a 
negative number. 

For example, if the significance level is 5% and 
the degrees of freedom is 18, then c = 1.734, and so 
Hp: 6; = 0 is rejected in favor of H,: B; < 0 at the 
5% level if tg, < — 1.734. It is important to remember 
that, to reject Hy against the negative alternative (4.8), 
we must get a negative f statistic. A positive f ratio, 
no matter how large, provides no evidence in favor of 
(4.8). The rejection rule is illustrated in Figure 4.3. 


Figure 4.3 5% rejection rule for the alternative H,: 6, < 0 with 18 df. 
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Student Performance and School Size 


There is much interest in the effect of school size on student performance. (See, for example, The 
New York Times Magazine, 5/28/95.) One claim is that, everything else being equal, students at 
smaller schools fare better than those at larger schools. This hypothesis is assumed to be true even 
after accounting for differences in class sizes across schools. 

The file MEAP93 contains data on 408 high schools in Michigan for the year 1993. We can use 
these data to test the null hypothesis that school size has no effect on standardized test scores against 
the alternative that size has a negative effect. Performance is measured by the percentage of students 
receiving a passing score on the Michigan Educational Assessment Program (MEAP) standardized 
tenth-grade math test (math10). School size is measured by student enrollment (enroll). The null 
hypothesis is Ho: Benro = 0, and the alternative is Hy: Benro < 0. For now, we will control for two 
other factors, average annual teacher compensation (totcomp) and the number of staff per one thou- 
sand students (staff). Teacher compensation is a measure of teacher quality, and staff size is a rough 
measure of how much attention students receive. 

The estimated equation, with standard errors in parentheses, is 


—_—_— —_. 
math10 = 2.274 + .00046 totcomp + .048 staff — .00020 enroll 
(6.113) (.00010) (.040) (.00022) 
n = 408, R? = .0541. 


The coefficient on enroll, —.00020, is in accordance with the conjecture that larger schools hamper 
performance: higher enrollment leads to a lower percentage of students with a passing tenth-grade 
math score. (The coefficients on totcomp and staff also have the signs we expect.) The fact that enroll 
has an estimated coefficient different from zero could just be due to sampling error; to be convinced 
of an effect, we need to conduct a ¢ test. 

Because n — k — 1 = 408 — 4 = 404, we use the standard normal critical value. At the 5% level, 
the critical value is — 1.65; the ż statistic on enroll must be less than — 1.65 to reject Hy at the 5% level. 

The ż statistic on enroll is —.00020/.00022 = —.91, which is larger than — 1.65: we fail to reject 
Ho in favor of H; at the 5% level. In fact, the 15% critical value is — 1.04, and because —.91 > — 1.04, 
we fail to reject Hy even at the 15% level. We conclude that enroll is not statistically significant at the 
15% level. 

The variable totcomp is statistically significant even at the 1% significance level because its 
t statistic is 4.6. On the other hand, the ¢ statistic for staff is 1.2, and so we cannot reject Ho: Bsa = 0 
against Hy: Bsa > 0 even at the 10% significance level. (The critical value is c = 1.28 from the stan- 
dard normal distribution.) 

To illustrate how changing functional form can affect our conclusions, we also estimate the 
model with all independent variables in logarithmic form. This allows, for example, the school size 
effect to diminish as school size increases. The estimated equation is 


Oe ett 
math10 = —207.66 + 21.16 log(totcomp) + 3.98 log(staff) — 1.29 log(enroll) 
(48.70) (4.06) (4.19) (0.69) 
n = 408, R? = .0654. 


The ¢ statistic on log(enroll) is about — 1.87; because this is below the 5% critical value — 1.65, we 
reject Ho: Biog(enroit) = O in favor of Hy: Bioglenron) < O at the 5% level. 

In Chapter 2, we encountered a model in which the dependent variable appeared in its origi- 
nal form (called level form), while the independent variable appeared in log form (called /evel-log 
model). The interpretation of the parameters is the same in the multiple regression context, except, 
of course, that we can give the parameters a ceteris paribus interpretation. Holding totcomp and staff 
fixed, we have Amath10 = —1.29[Alog(enroll) |, so that 


ant, 
Amath10 = —(1.29/100)(%Aenroll) = —.013(%Aenroll). 
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Once again, we have used the fact that the change in log(enroll), when multiplied by 100, is approxi- 
mately the percentage change in enroll. Thus, if enrollment is 10% higher at a school, math10 is pre- 
dicted to be .013(10) = 0.13 percentage points lower (math10 is measured as a percentage). 

Which model do we prefer, the one using the level of enroll or the one using log(enroll)? In the 
level-level model, enrollment does not have a statistically significant effect, but in the level-log model 
it does. This translates into a higher R-squared for the level-log model, which means we explain more 
of the variation in math10 by using enroll in logarithmic form (6.5% to 5.4%). The level-log model 
is preferred because it more closely captures the relationship between math10 and enroll. We will say 
more about using R-squared to choose functional form in Chapter 6. 


4-2b Two-Sided Alternatives 


In applications, it is common to test the null hypothesis Hy: 6; = 0 against a two-sided alternative; 
that is, 


Hi: B, #0. [4.10] 


Under this alternative, x; has a ceteris paribus effect on y without specifying whether the effect is posi- 
tive or negative. This is the relevant alternative when the sign of 6; is not well determined by theory 
(or common sense). Even when we know whether $; is positive or negative under the alternative, a 
two-sided test is often prudent. At a minimum, using a two-sided alternative prevents us from look- 
ing at the estimated equation and then basing the alternative on whether Ê, is positive or negative. 
Using the regression estimates to help us formulate the null or alternative hypotheses is not allowed 


Figure 4.4 5% rejection rule for the alternative H,: 6; # 0 with 25 df. 
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because classical statistical inference presumes that we state the null and alternative about the popula- 
tion before looking at the data. For example, we should not first estimate the equation relating math 
performance to enrollment, note that the estimated effect is negative, and then decide the relevant 
alternative is Hy: Borrow < O. 

When the alternative is two-sided, we are interested in the absolute value of the t statistic. The 
rejection rule for Hy: 8; = 0 against (4.10) is 


lig] > c, [4.11] 


where || denotes absolute value and c is an appropriately chosen critical value. To find c, we again 
specify a significance level, say 5%. For a two-tailed test, c is chosen to make the area in each tail 
of the ¢ distribution an equal 2.5%. In other words, c is the 97.5" percentile in the ¢ distribution with 
n — k — 1 degrees of freedom. When n — k — 1 = 25, the 5% critical value for a two-sided test is 
c = 2.060. Figure 4.4 provides an illustration of this distribution. 

When a specific alternative is not stated, it is usually considered to be two-sided. In the remainder 
of this text, the default will be a two-sided alternative, and 5% will be the default significance level. 
When carrying out empirical econometric analysis, it is always a good idea to be explicit about the 
alternative and the significance level. If Họ is rejected in favor of (4.10) at the 5% level, we usually 
say that “x; is statistically significant, or statistically different from zero, at the 5% level.” If Hy is not 
rejected, we say that “x; is statistically insignificant at the 5% level.” 


Determinants of College GPA 


We use the data in GPAI to estimate a model explaining college GPA (colGPA), with the average 
number of lectures missed per week (skipped) as an additional explanatory variable. The estimated 
model is 


—_—_— —- 
colGPA = 1.39 + .412 hsGPA + .015 ACT — .083 skipped 
(.33) (.094) (.011) (.026) 
n = 141, R? = 234, 


We can easily compute f statistics to see which variables are statistically significant, using a two- 
sided alternative in each case. The 5% critical value is about 1.96, because the degrees of freedom 
(141 — 4 = 137) is large enough to use the standard normal approximation. The 1% critical value is 
about 2.58. 

The f statistic on hsGPA is 4.38, which is significant at very small significance levels. Thus, we 
say that “hsGPA is statistically significant at any conventional significance level.” The ¢ statistic on 
ACT is 1.36, which is not statistically significant at the 10% level against a two-sided alternative. The 
coefficient on ACT is also practically small: a 10-point increase in ACT, which is large, is predicted 
to increase colGPA by only .15 points. Thus, the variable ACT is practically, as well as statistically, 
insignificant. 

The coefficient on skipped has a t statistic of —.083/.026 = —3.19, so skipped is statistically 
significant at the 1% significance level (3.19 > 2.58). This coefficient means that another lecture 
missed per week lowers predicted colGPA by about .083. Thus, holding hsGPA and ACT fixed, the 
predicted difference in colGPA between a student who misses no lectures per week and a student who 
misses five lectures per week is about .42. Remember that this says nothing about specific students; 
rather, .42 is the estimated average across a subpopulation of students. 

In this example, for each variable in the model, we could argue that a one-sided alternative is 
appropriate. The variables hsGPA and skipped are very significant using a two-tailed test and have the 
signs that we expect, so there is no reason to do a one-tailed test. On the other hand, against a one- 
sided alternative (8; > 0), ACT is significant at the 10% level but not at the 5% level. This does not 
change the fact that the coefficient on ACT is pretty small. 
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4-2c Testing Other Hypotheses about 6; 


Although Hp: 8; = 0 is the most common hypothesis, we sometimes want to test whether £; is equal 
to some other given constant. Two common examples are 6; = 1 and B; = —1. Generally, if the null 
is stated as 


Ho: Bj = 4, [4.12] 
where a; is our hypothesized value of £, then the appropriate t statistic is 
t= (Ê; — a;)/se(ĝ;). 
As before, t measures how many estimated standard deviations Ê; is away from the hypothesized 


value of 6,. The general f statistic is usefully written as 


estimate — hypothesized value 
=f ae ) [4.13] 


standard error 


Under (4.12), this ¢ statistic is distributed as ¢,,_,_, from Theorem 4.2. The usual ż statistic is obtained 
when a; = 0. 

We can use the general f statistic to test against one-sided or two-sided alternatives. For example, 
if the null and alternative hypotheses are Ho: B; = 1 and H,: 6; > 1, then we find the critical value 
for a one-sided alternative exactly as before: the difference is in how we compute the ¢ statistic, not in 
how we obtain the appropriate c. We reject Hy in favor of H, if t > c. In this case, we would say that 
“Ê; is statistically greater than one” at the appropriate significance level. 


EXAMPLE 4.4 Campus Crime and Enrollment 


Consider a simple model relating the annual number of crimes on college campuses (crime) to student 
enrollment (enroll): 


log(crime) = Bo + Biılog(enroll) + u. 


This is a constant elasticity model, where , is the elasticity of crime with respect to enrollment. It 
is not much use to test Hy: 8B; = 0, as we expect the total number of crimes to increase as the size 
of the campus increases. A more interesting hypothesis to test would be that the elasticity of crime 
with respect to enrollment is one: Hp: 8; = 1. This means that a 1% increase in enrollment leads to, 
on average, a 1% increase in crime. A noteworthy alternative is H,: 8; > 1, which implies that a 1% 
increase in enrollment increases campus crime by more than 1%. If 6, > 1, then, in a relative sense— 
not just an absolute sense—crime is more of a problem on larger campuses. One way to see this is to 
take the exponential of the equation: 


crime = exp(B)enroll®'exp(u). 


(See Math Refresher A for properties of the natural logarithm and exponential functions.) For By = 0 
and u = 0, this equation is graphed in Figure 4.5 for B, < 1, 8B, = 1, and B, > 1. 

We test 8; = 1 against 6; > 1 using data on 97 colleges and universities in the United States for 
the year 1992, contained in the data file CAMPUS. The data come from the FBI’s Uniform Crime 
Reports, and the average number of campus crimes in the sample is about 394, while the average 
enrollment is about 16,076. The estimated equation (with estimates and standard errors rounded to 
two decimal places) is 


log(crime) = —6.63 + 1.27 log(enroll) 
(1.03) (0.11) [4.14] 
97, R? = 585. 


3 
II 
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Figure 4.5 Graph of crime = enroll®' for B, < 1, B; = 1, and B, > 1. 
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The estimated elasticity of crime with respect to enroll, 1.27, is in the direction of the alterna- 
tive 6; > 1. But is there enough evidence to conclude that 6; > 1? We need to be careful in 
testing this hypothesis, especially because the statistical output of standard regression packages 
is much more complex than the simplified output reported in equation (4.14). Our first instinct 
might be to construct “the” t statistic by taking the coefficient on log(enroll) and dividing it 
by its standard error, which is the ż statistic reported by a regression package. But this is the 
wrong Statistic for testing Hp: 6; = 1. The correct f statistic is obtained from (4.13): we subtract 
the hypothesized value, unity, from the estimate and divide the result by the standard error of 
Bi t = (1.27 — 1)/.11 = .27/.11 = 2.45. The one-sided 5% critical value for a ¢ distribution with 
97 — 2 = 95 df is about 1.66 (using df = 120), so we clearly reject 8, = 1 in favor of B, > 1 at 
the 5% level. In fact, the 1% critical value is about 2.37, and so we reject the null in favor of the 
alternative at even the 1% level. 

We should keep in mind that this analysis holds no other factors constant, so the elasticity of 
1.27 is not necessarily a good estimate of ceteris paribus effect. It could be that larger enrollments are 
correlated with other factors that cause higher crime: larger schools might be located in higher crime 
areas. We could control for this by collecting data on crime rates in the local city. 


For a two-sided alternative, for example Ho: 8; = —1, H,: 8, # —1, we still compute the ż statis- 
tic as in (4.13): £ = (B, + 1)/se(B;) (notice how subtracting —1 means adding 1). The rejection rule 
is the usual one for a two-sided test: reject Hy if |t| > c, where c is a two-tailed critical value. If Hy 
is rejected, we say that “B; is statistically different from negative one” at the appropriate significance 
level. 
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Housing Prices and Air Pollution 


For a sample of 506 communities in the Boston area, we estimate a model relating median housing 
price (price) in the community to various community characteristics: nox is the amount of nitrogen 
oxide in the air, in parts per million; dist is a weighted distance of the community from five employ- 
ment centers, in miles; rooms is the average number of rooms in houses in the community; and stratio 
is the average student-teacher ratio of schools in the community. The population model is 


log(price) = Bo + Bilog(nox) + Bolog(dist) + B,rooms + Bystratio + u. 


Thus, 6; is the elasticity of price with respect to nox. We wish to test Hp: 6; = —1 against the alterna- 
tive H,: B, # —1. The ż statistic for doing this test is t = (8B, + 1)/se(B,). 
Using the data in HPRICE2, the estimated model is 


Ei, 
log(price) = 11.08 — .954log(nox) — .134log(dist) + .255rooms — .052:stratio 
(0.32) (.117) (.043) (.019) (.006) 
n = 506, R? = .581. 


The slope estimates all have the anticipated signs. Each coefficient is statistically different from 
zero at very small significance levels, including the coefficient on log(nox). But we do not want 
to test that 6; = 0. The null hypothesis of interest is Hp: 6, = —1, with corresponding ż statistic 
(—.954 + 1)/.117 = .393. There is little need to look in the f table for a critical value when the 
t statistic is this small: the estimated elasticity is not statistically different from —1 even at very large 
significance levels. Controlling for the factors we have included, there is little evidence that the elas- 
ticity is different from —1. 


4-2d Computing p-Values for f Tests 


So far, we have talked about how to test hypotheses using a classical approach: after stating the alter- 
native hypothesis, we choose a significance level, which then determines a critical value. Once the 
critical value has been identified, the value of the ¢ statistic is compared with the critical value, and the 
null is either rejected or not rejected at the given significance level. 

Even after deciding on the appropriate alternative, there is a component of arbitrariness to the 
classical approach, which results from having to choose a significance level ahead of time. Different 
researchers prefer different significance levels, depending on the particular application. There is no 
“correct” significance level. 

Committing to a significance level ahead of time can hide useful information about the outcome of 
a hypothesis test. For example, suppose that we wish to test the null hypothesis that a parameter is zero 
against a two-sided alternative, and with 40 degrees of freedom we obtain a ż statistic equal to 1.85. 
The null hypothesis is not rejected at the 5% level, because the f statistic is less than the two-tailed 
critical value of c = 2.021. A researcher whose agenda is not to reject the null could simply report this 
outcome along with the estimate: the null hypothesis is not rejected at the 5% level. Of course, if the 
t statistic, or the coefficient and its standard error, are reported, then we can also determine that the null 
hypothesis would be rejected at the 10% level, because the 10% critical value is c = 1.684. 

Rather than testing at different significance levels, it is more informative to answer the following 
question: given the observed value of the ż statistic, what is the smallest significance level at which 
the null hypothesis would be rejected? This level is known as the p-value for the test (see Math 
Refresher C). In the previous example, we know the p-value is greater than .05, because the null is 
not rejected at the 5% level, and we know that the p-value is less than .10, because the null is rejected 
at the 10% level. We obtain the actual p-value by computing the probability that a ż random variable, 
with 40 df, is larger than 1.85 in absolute value. That is, the p-value is the significance level of the test 
when we use the value of the test statistic, 1.85 in the above example, as the critical value for the test. 
This p-value is shown in Figure 4.6. 
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Figure 4.6 Obtaining the p-value against a two-sided alternative, when t = 1.85 and df = 40. 
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Because a p-value is a probability, its value is always between zero and one. In order to compute 
p-values, we either need extremely detailed printed tables of the t distribution—which is not very 
practical—or a computer program that computes areas under the probability density function of the 
t distribution. Most modern regression packages have this capability. Some packages compute p-values 
routinely with each OLS regression, but only for certain hypotheses. If a regression package reports 
a p-value along with the standard OLS output, it is almost certainly the p-value for testing the null 
hypothesis Ho: 6; = 0 against the two-sided alternative. The p-value in this case is 


PIT] > |z), [4.15] 


where, for clarity, we let T denote a ¢ distributed random variable with n — k — 1 degrees of freedom 
and let ż denote the numerical value of the test statistic. 

The p-value nicely summarizes the strength or weakness of the empirical evidence against the 
null hypothesis. Perhaps its most useful interpretation is the following: the p-value is the probability 
of observing a ż statistic as extreme as we did if the null hypothesis is true. This means that small 
p-values are evidence against the null; large p-values provide little evidence against Hy. For exam- 
ple, if the p-value = .50 (reported always as a decimal, not a percentage), then we would observe a 
value of the ¢ statistic as extreme as we did in 50% of all random samples when the null hypothesis is 
true; this is pretty weak evidence against Hp. 

In the example with df = 40 and t = 1.85, the p-value is computed as 


p-value = P(|7| > 1.85) = 2P(T > 1.85) = 2(.0359) = .0718, 


where P(T > 1.85) is the area to the right of 1.85 in a ¢ distribution with 40 df. (This value was com- 
puted using the econometrics package Stata; it is not available in Table G.2.) This means that, if the 
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null hypothesis is true, we would observe an absolute value of the ż statistic as large as 1.85 about 
7.2 percent of the time. This provides some evidence against the null hypothesis, but we would not 
reject the null at the 5% significance level. 

The previous example illustrates that once the p-value has been computed, a classical test can be 
carried out at any desired level. If œ denotes the significance level of the test (in decimal form), then 
Ho is rejected if p-value < a; otherwise, Hp is not rejected at the 100-7% level. 

Computing p-values for one-sided alternatives is also quite simple. Suppose, for example, that we 
test Hy: 8; = 0 against H,: 6; > 0. If Ê; < 0, then computing a p-value is not important: we know that 
the p-value is greater than .50, which will never cause us to reject Hy in favor of H}. If Ê; > 0, then 
t > 0 and the p-value is just the probability that a random ¢ variable with the appropriate df exceeds 
the value rt. Some regression packages only compute p-values for two-sided alternatives. But it is 
simple to obtain the one-sided p-value: just divide the two-sided p-value by 2. 

If the alternative is H,: £, < 0, it makes sense to compute a p-value if Ê; < 0 (and hence t < 0): 
p-value = P(T < t) = P(T > |t|) because the ż distribution is symmetric about zero. Again, this can 
be obtained as one-half of the p-value for the two-tailed test. 

Because you will quickly become familiar with 
GOING FURTHER 4.3 the magnitudes of ¢ statistics that lead to statistical 
significance, especially for large sample sizes, it is 
not always crucial to report p-values for t statistics. 
testing Ho: 8, = O against H,: B, # O. What But it does not hurt to report them. Further, when we 
is the p-value for testing Ho: 64 = O against discuss F testing in Section 4-5, we will see that it is 
Ho > 0? important to compute p-values, because critical val- 
ues for F tests are not so easily memorized. 


Suppose you estimate a regression model 
and obtain 4 = .56 and p-value = .086 for 


4-2e A Reminder on the Language of Classical Hypothesis Testing 


When Hgo is not rejected, we prefer to use the language “we fail to reject Hp at the x% level,” rather 
than “Ho is accepted at the x% level.” We can use Example 4.5 to illustrate why the former statement 
is preferred. In this example, the estimated elasticity of price with respect to nox is —.954, and the 
t statistic for testing Hp: Baos = —1 is t = .393; therefore, we cannot reject Hy. But there are many 
other values for £,,, (more than we can count) that cannot be rejected. For example, the t statistic 
for Ho: Brox = —-9 is (—.954 + .9)/.117 = —.462, and so this null is not rejected either. Clearly 
Brox = — 1 and B,,, = —-9 cannot both be true, so it makes no sense to say that we “accept” either of 
these hypotheses. All we can say is that the data do not allow us to reject either of these hypotheses at 
the 5% significance level. 


4-2f Economic, or Practical, versus Statistical Significance 


Because we have emphasized statistical significance throughout this section, now is a good time to 
remember that we should pay attention to the magnitude of the coefficient estimates in addition to the 
size of the ¢ statistics. The statistical significance of a variable x; is determined entirely by the size of 
tg, whereas the economic significance or practical signineance of a variable is related to the size 
(and sign) of Be 

Recall that the ¢ statistic for testing Ho: B; = 0 is defined by dividing the estimate by its stan- 
dard error: tg = Ê;/se( B,): Thus, 7g can indicate statistical significance either because Bi is “large” 
or because se(ĝ,) is “small.” It is important in practice to distinguish between these reasons for statis- 
tically significant f statistics. Too much focus on statistical significance can lead to the false conclu- 
sion that a variable is “important” for explaining y even though its estimated effect is modest. 
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EXAMPLE 4.6 Participation Rates in 401(k) Plans 


In Example 3.3, we used the data on 401(k) plans to estimate a model describing participation rates in 
terms of the firm’s match rate and the age of the plan. We now include a measure of firm size, the total 
number of firm employees (totemp). The estimated equation is 


—_——~ 
prate = 80.29 + 5.44 mrate + .269 age — .00013 totemp 
(0.78) (0.52) (.045) (.00004) 
n = 1,534, R? = .100. 


The smallest ¢ statistic in absolute value is that on the variable totemp: t = —.00013/.00004 = —3.25, 
and this is statistically significant at very small significance levels. (The two-tailed p-value for this 
t statistic is about .001.) Thus, all of the variables are statistically significant at rather small signifi- 
cance levels. 

How big, in a practical sense, is the coefficient on totemp? Holding mrate and age fixed, if a firm 
grows by 10,000 employees, the participation rate falls by 10,000(.00013) = 1.3 percentage points. 
This is a huge increase in number of employees with only a modest effect on the participation rate. 
Thus, although firm size does affect the participation rate, the effect is not practically very large. 


The previous example shows that it is especially important to interpret the magnitude of the coef- 
ficient, in addition to looking at ¢ statistics, when working with large samples. With large sample 
sizes, parameters can be estimated very precisely: standard errors are often quite small relative to the 
coefficient estimates, which usually results in statistical significance. 

Some researchers insist on using smaller significance levels as the sample size increases, partly 
as a way to offset the fact that standard errors are getting smaller. For example, if we feel comfortable 
with a 5% level when n is a few hundred, we might use the 1% level when n is a few thousand. Using 
a smaller significance level means that economic and statistical significance are more likely to coin- 
cide, but there are no guarantees: in the previous example, even if we use a significance level as small 
as .1% (one-tenth of 1%), we would still conclude that totemp is statistically significant. 

Many researchers are also willing to entertain larger significance levels in applications with 
small sample sizes, reflecting the fact that it is harder to find significance with smaller sample sizes. 
(Smaller sample sizes lead to less precise estimators, and the critical values are larger in magnitude, 
two factors that make it harder to find statistical significance.) Unfortunately, one’s willingness to 
consider higher significance levels can depend on one’s underlying agenda. 


Effect of Job Training on Firm Scrap Rates 


The scrap rate for a manufacturing firm is the number of defective items—products that must be 
discarded—out of every 100 produced. Thus, for a given number of items produced, a decrease in the 
scrap rate reflects higher worker productivity. 

We can use the scrap rate to measure the effect of worker training on productivity. Using the data 
in JTRAIN, but only for the year 1987 and for nonunionized firms, we obtain the following estimated 
equation: 


—_—_—_ ~~ 
log(scrap) = 12.46 — .029 hrsemp — .962 log(sales) + .761 log(employ) 
(5.69) (.023) (.453) (.407) 
n = 29, R? = 262, 
The variable hrsemp is annual hours of training per employee, sales is annual firm sales (in dollars), 


and employ is the number of firm employees. For 1987, the average scrap rate in the sample is about 
4.6 and the average of hrsemp is about 8.9. 
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The main variable of interest is hrsemp. One more hour of training per employee lowers log(scrap) 
by .029, which means the scrap rate is about 2.9% lower. Thus, if hrsemp increases by 5—each 
employee is trained 5 more hours per year—the scrap rate is estimated to fall by 5(2.9) = 14.5%. 
This seems like a reasonably large effect, but whether the additional training is worthwhile to the firm 
depends on the cost of training and the benefits from a lower scrap rate. We do not have the numbers 
needed to do a cost-benefit analysis, but the estimated effect seems nontrivial. 

What about the statistical significance of the training variable? The f statistic on hrsemp is 
—.029/.023 = —1.26, and now you probably recognize this as not being large enough in magnitude 
to conclude that hrsemp is statistically significant at the 5% level. In fact, with 29 — 4 = 25 degrees 
of freedom for the one-sided alternative, Hy: Bjrsem, < 0, the 5% critical value is about — 1.71. Thus, 
using a strict 5% level test, we must conclude that hrsemp is not statistically significant, even using a 
one-sided alternative. 


Because the sample size is pretty small, we might be more liberal with the significance level. The 
10% critical value is — 1.32, and so Arsemp is almost significant against the one-sided alternative at 
the 10% level. The p-value is easily computed as P(T,; < —1.26) = .110. This may be a low enough 
p-value to conclude that the estimated effect of training is not just due to sampling error, but opinions 
would legitimately differ on whether a one-sided p-value of .11 is sufficiently small. 


Remember that large standard errors can also be a result of multicollinearity (high correlation 
among some of the independent variables), even if the sample size seems fairly large. As we dis- 
cussed in Section 3-4, there is not much we can do about this problem other than to collect more data 
or change the scope of the analysis by dropping or combining certain independent variables. As in 
the case of a small sample size, it can be hard to precisely estimate partial effects when some of the 
explanatory variables are highly correlated. (Section 4-5 contains an example.) 

We end this section with some guidelines for discussing the economic and statistical significance 
of a variable in a multiple regression model: 


1. Check for statistical significance. If the variable is statistically significant, discuss the magnitude 
of the coefficient to get an idea of its practical or economic importance. This latter step can require 
some care, depending on how the independent and dependent variables appear in the equation. (In 
particular, what are the units of measurement? Do the variables appear in logarithmic form?) 


2. If a variable is not statistically significant at the usual levels (10%, 5%, or 1%), you might still 
ask if the variable has the expected effect on y and whether that effect is practically large. If it is 
large, you should compute a p-value for the ż statistic. For small sample sizes, you can sometimes 
make a case for p-values as large as .20 (but there are no hard rules). With large p-values, that is, 
small ¢ statistics, we are treading on thin ice because the practically large estimates may be due to 
sampling error: a different random sample could result in a very different estimate. 


3. Itis common to find variables with small ¢ statistics that have the “wrong” sign. For practical pur- 
poses, these can be ignored: we conclude that the variables are statistically insignificant. A signif- 
icant variable that has the unexpected sign and a practically large effect is much more troubling 
and difficult to resolve. One must usually think more about the model and the nature of the data 
to solve such problems. Often, a counterintuitive, significant estimate results from the omission 
of a key variable or from one of the important problems we will discuss in Chapters 9 and 15. 


4-3 Confidence Intervals 


Under the CLM assumptions, we can easily construct a confidence interval (CI) for the population 
parameter §,. Confidence intervals are also called interval estimates because they provide a range of 
likely values for the population parameter, and not just a point estimate. 
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Using the fact that (Ê; - B;)/se(B;) has a ¢ distribution with n — k — 1 degrees of freedom 
[see (4.3)], simple manipulation leads to a CI for the unknown £;: a 95% confidence interval, given by 


Ê; + cse(B)), [4.16] 


where the constant c is the 97.5™ percentile in a ¢,_,—, distribution. More precisely, the lower and 
upper bounds of the confidence interval are given by 


B; = Ê; — cse(B)) 


and 


B; = Ê; + c-se(B)), 
respectively. 

At this point, it is useful to review the meaning of a confidence interval. If random samples were 
obtained over and over again, with 6; and B; computed each time, then the (unknown) population value 
B; would lie in the interval (6; 6;) for 95% of the samples. Unfortunately, for the single sample that 
we use to construct the CI, we do not know whether £; is actually contained in the interval. We hope 
we have obtained a sample that is one of the 95% of all samples where the interval estimate contains 
B; but we have no guarantee. 

Constructing a confidence interval is very simple when using current computing technology. Three 
quantities are needed: B;. se( Ê). and c. The coefficient estimate and its standard error are reported by 
any regression package. To obtain the value c, we must know the degrees of freedom, n — k — 1, and 
the level of confidence—95% in this case. Then, the value for c is obtained from the ¢,,_,_, distribution. 

_ As an example, for df = n — k — 1 = 25, a 95% confidence interval for any £; is given by 
[B; — 2.06-se(B;),B; + 2.06-se(B)) J. 

When n — k — 1 > 120, the t,_,_, distribution is close enough to normal to use the 97.5" per- 
centile in a standard normal distribution for constructing a 95% CI: Ê; + 1.96-se(B;). In fact, when 
n — k — 1 > 50, the value of c is so close to 2 that we can use a simple rule of thumb for a 95% con- 
fidence interval: Ê; plus or minus two of its standard errors. For small degrees of freedom, the exact 
percentiles should be obtained from the ¢ tables. 

It is easy to construct confidence intervals for any other level of confidence. For exam- 
ple, a 90% CI is obtained by choosing c to be the 95" percentile in the ¢,_,—, distribution. When 
df =n — k — 1 = 25, c = 1.71, and so the 90% CI is Ê; £ 1.71-se(ĝ;), which is necessarily nar- 
rower than the 95% CI. For a 99% CI, c is the 99.5" percentile in the f,; distribution. When df = 25, 
the 99% CI is roughly Ê; + 2.79-se(B,), which is inevitably wider than the 95% CI. 

Many modern regression packages save us from doing any calculations by reporting a 95% CI 
along with each coefficient and its standard error. After a confidence interval is constructed, it is easy 
to carry out two-tailed hypotheses tests. If the null hypothesis is Ho: 6; = a;, then Hp is rejected against 
H,: B; # a; at (say) the 5% significance level if, and only if, a; is not in the 95% confidence interval. 


EXAMPLE 4.8 Model of R&D Expenditures 


Economists studying industrial organization are interested in the relationship between firm size— 
often measured by annual sales—and spending on research and development (R&D). Typically, a 
constant elasticity model is used. One might also be interested in the ceteris paribus effect of the profit 
margin—that is, profits as a percentage of sales—on R&D spending. Using the data in RDCHEM on 
32 U.S. firms in the chemical industry, we estimate the following equation (with standard errors in 
parentheses below the coefficients): 


—_—_ O 
log(rd) = —4.38 + 1.084 log(sales) + .0217 profmarg 
(.47) (.060) (.0128) 
n = 32, R° = .918. 
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The estimated elasticity of R&D spending with respect to firm sales is 1.084, so that, holding profit mar- 
gin fixed, a 1% increase in sales is associated with a 1.084% increase in R&D spending. (Incidentally, 
R&D and sales are both measured in millions of dollars, but their units of measurement have no effect on 
the elasticity estimate.) We can construct a 95% confidence interval for the sales elasticity once we note 
that the estimated model has n — k — 1 = 32 — 2 — 1 = 29 degrees of freedom. From Table G.2, we 
find the 97.5" percentile in a hy distribution: c = 2.045. Thus, the 95% confidence interval for PBrog(sates) 
is 1.084 + .060(2.045), or about (.961,1.21). That zero is well outside this interval is hardly surpris- 
ing: we expect R&D spending to increase with firm size. More interesting is that unity is included in 
the 95% confidence interval for Pioo(saies)» Which means that we cannot reject Ho: Pioo(saes) = 1 against 
Hy: Biog(sates) £ 1 at the 5% significance level. In other words, the estimated R&D-sales elasticity is not 
statistically different from | at the 5% level. (The estimate is not practically different from 1, either.) 

The estimated coefficient on profmarg is also positive, and the 95% confidence interval for the 
population parameter, B profinarg» 18 0217 + .0128(2.045), or about (—.0045, .0479). In this case, zero is 
included in the 95% confidence interval, so we fail to reject Ho: Borofinare = O against Hy: Brrofinare F O 
at the 5% level. Nevertheless, the ¢ statistic is about 1.70, which gives a two-sided p-value of about .10, 
and so we would conclude that profmarg is statistically significant at the 10% level against the two- 
sided alternative, or at the 5% level against the one-sided alternative Hy: B,,ofarg > 0. Plus, the eco- 
nomic size of the profit margin coefficient is not trivial: holding sales fixed, a one percentage point 
increase in profmarg is estimated to increase R&D spending by 100(.0217) ~ 2.2%. A complete 
analysis of this example goes beyond simply stating whether a particular value, zero in this case, is or 
is not in the 95% confidence interval. 


You should remember that a confidence interval is only as good as the underlying assump- 
tions used to construct it. If we have omitted important factors that are correlated with the explana- 
tory variables, then the coefficient estimates are not reliable: OLS is biased. If heteroskedasticity 
is present—for instance, in the previous example, if the variance of log(rd) depends on any of the 
explanatory variables—then the standard error is not valid as an estimate of sd( Ê j) (as we discussed 
in Section 3-4), and the confidence interval computed using these standard errors will not truly be a 
95% CI. We have also used the normality assumption on the errors in obtaining these CIs, but, as we 
will see in Chapter 5, this is not as important for applications involving hundreds of observations. 


4-4 Testing Hypotheses about a Single Linear 
Combination of the Parameters 


The previous two sections have shown how to use classical hypothesis testing or confidence intervals 
to test hypotheses about a single 6; at a time. In applications, we must often test hypotheses involving 
more than one of the population parameters. In this section, we show how to test a single hypothesis 
involving more than one of the 6;. Section 4-5 shows how to test multiple hypotheses. 

To illustrate the general approach, we will consider a simple model to compare the returns to edu- 
cation at junior colleges and four-year colleges; for simplicity, we refer to the latter as “universities.” 
[Kane and Rouse (1995) provide a detailed analysis of the returns to two- and four-year colleges.] The 
population includes working people with a high school degree, and the model is 


log(wage) = By + Bi jc + Bouniv + B exper + u, [4.17] 
where 


jc = number of years attending a two-year college. 
univ = number of years at a four-year college. 
exper = months in the workforce. 
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Note that any combination of junior college and four-year college is allowed, including jc = 0 and 
univ = 0. 

The hypothesis of interest is whether one year at a junior college is worth one year at a university: 
this is stated as 


Ho: By = Bo. [4.18] 


Under Ho, another year at a junior college and another year at a university lead to the same ceteris 
paribus percentage increase in wage. For the most part, the alternative of interest is one-sided: a year 
at a junior college is worth less than a year at a university. This is stated as 


Hi: Bi < b. [4.19] 


The hypotheses in (4.18) and (4.19) concern two parameters, B, and £., a situation we have not 
faced yet. We cannot simply use the individual t statistics for Êi and A to test Hp. However, concep- 
tually, there is no difficulty in constructing a ¢ statistic for testing (4.18). To do so, we rewrite the 
null and alternative as Ho: 8; — B, = 0 and Hı: B; — £2 < 0, respectively. The ¢ statistic is based on 
whether the estimated difference ĝi- 1 pei is sufficiently less than zero to warrant rejecting (4.18) in 
favor of (4.19). To account for the sampling error in our estimators, we standardize this difference by 
dividing by the standard error: 


Êi ~~ Bo 
Se 4.20 
: se(B, = Bo) l 


Once we have the ż statistic in (4.20), testing proceeds as before. We choose a significance level 
for the test and, based on the df, obtain a critical value. Because the alternative is of the form in (4.19), 
the rejection rule is of the form t < —c, where c is a positive value chosen from the appropriate f dis- 
tribution. Or we compute the f statistic and then compute the p-value (see Section 4-2). 

The only thing that makes testing the equality of two different parameters more difficult than 
testing about a single 6; is obtaining the standard error in the denominator of (4.20). Obtaining the 
numerator is trivial once we have performed the OLS regression. Using the data in TWOYEAR, 
which comes from Kane and Rouse (1995), we estimate equation (4.17): 


Sen CN 
log(wage) = 1.472 + .0667 jc + .0769 univ + .0049 exper 
(.021) (.0068) (.0023) (.0002) [4.21] 
n = 6,763, R? = .222. 


It is clear from (4.21) that jc and univ have both economically and statistically significant effects on 
wage. This is certainly of interest, but we are more concerned about testing whether the estimated dif- 
ference in the coefficients is statistically significant. The difference is estimated as B = Bo = —.0102, 
so the return to a year at a junior college is about one percentage point less than a year at a university. 
Economically, this is not a trivial difference. The difference of —.0102 is the numerator of the t sta- 
tistic in (4.20). 

Unfortunately, the regression results in equation (4.21) do not contain enough information to obtain 
the standard error of B, — B». It might be tempting to claim that se(B, — B,) = se(B,) — se(B,), but 
this is not true. In fact, if we reversed the roles of B ı and Bo, we would wind up with a negative stan- 
dard error of the difference using the difference in standard errors. Standard errors must always be 
positive because they are estimates of standard deviations. Although the standard error of the differ- 
ence Ê; — Ê, certainly depends on se(f,) and se(B,), it does so in a somewhat complicated way. To 
find se( B i= Bo), we first obtain the variance of the difference. Using the results on variances in Math 
Refresher B, we have 


ar(B, E Bo) = Var(,) T Var(B.) = 2 Cov(B;, >). [4.22] 
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Observe carefully how the two variances are added together, and twice the covariance is then sub- 
tracted. The standard deviation of 8, — B, is just the square root of (4.22), and, because [se(,) |? is 
an unbiased estimator of Var( 8), and similarly for [se(B,) |’, we have 


se(B, — B,) = {[se(B,)P + [se(B,) P — 25,.}'”, [4.23] 


where s,, denotes an estimate of Cov(,,8,). We have not displayed a formula for Cov(,,8,). Some 
regression packages have features that allow one to obtain s,,, in which case one can compute the 
standard error in (4.23) and then the ¢ statistic in (4.20). Advanced Treatment E shows how to use 
matrix algebra to obtain s42. 

Some of the more sophisticated econometrics programs include special commands that can be 
used for testing hypotheses about linear combinations. Here, we cover an approach that is simple to 
compute in virtually any statistical package. Rather than trying to compute se( B = B>) from (4.23), 
it is much easier to estimate a different model that directly delivers the standard error of interest. 
Define a new parameter as the difference between 6, and 8,: 0; = B, — By. Then, we want to test 


Ho: 6; = 0 against H,: 0, < 0. [4.24] 


The f statistic in (4.20) in terms of 6, is just £ = 6,/se(6,). The challenge is finding se(6,). 

We can do this by rewriting the model so that 0; appears directly on one of the independent 
variables. Because 6; = 6, — 2, we can also write 6, = 0, + f2. Plugging this into (4.17) and rear- 
ranging gives the equation 


log(wage) = By + (0, + Ba)jc + Bouniv + Byexper + u 


= Bo + ije + Boje + univ) + B;exper + u. pail 


The key insight is that the parameter we are interested in testing hypotheses about, 0}, now multiplies 
the variable jc. The intercept is still By, and exper still shows up as being multiplied by B;. More 
importantly, there is a new variable multiplying 2, namely jc + univ. Thus, if we want to directly 
estimate 6, and obtain the standard error of 6, then we must construct the new variable jc + univ 
and include it in the regression model in place of univ. In this example, the new variable has a natural 
interpretation: it is total years of college, so define totcoll = jc + univ and write (4.25) as 


log(wage) = By + 0, je + Bototcoll + Byexper + u. [4.26] 


The parameter 6, has disappeared from the model, while 0; appears explicitly. This model is really 
just a different way of writing the original model. The only reason we have defined this new model is 
that, when we estimate it, the coefficient on jc is 6, and, more importantly, se(0, ) is reported along 
with the estimate. The f statistic that we want is the one reported by any regression package on the 
variable jc (not the variable totcoll). 

When we do this with the 6,763 observations used earlier, the result is 


a 
log(wage) = 1.472 — .0102 jc + .0769 totcoll + .0049 exper 
(.021) (.0069) (.0023) (.0002) [4.27] 
n = 6,763, R? = .222. 


The only number in this equation that we could not get from (4.21) is the standard error for the esti- 
mate —.0102, which is .0069. The ż statistic for testing (4.18) is —.0102/.0069 = — 1.48. Against 
the one-sided alternative (4.19), the p-value is about .070, so there is some, but not strong, evidence 
against (4.18). 

The intercept and slope estimate on exper, along with their standard errors, are the same as in 
(4.21). This fact must be true, and it provides one way of checking whether the transformed equation 
has been properly estimated. The coefficient on the new variable, totcoll, is the same as the coefficient 
on univ in (4.21), and the standard error is also the same. We know that this must happen by compar- 
ing (4.17) and (4.25). 
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It is quite simple to compute a 95% confidence interval for 6, = 6, — 2. Using the standard 
normal approximation, the CI is obtained as usual: 6, + 1.96 se(0,), which in this case leads to 
—.0102 + .0135. 

The strategy of rewriting the model so that it contains the parameter of interest works in all cases 
and is easy to implement. (See Computer Exercises C1 and C3 for other examples.) 


4-5 Testing Multiple Linear Restrictions: The F Test 


The ż statistic associated with any OLS coefficient can be used to test whether the corresponding 
unknown parameter in the population is equal to any given constant (which is usually, but not always, 
zero). We have just shown how to test hypotheses about a single linear combination of the £; by rear- 
ranging the equation and running a regression using transformed variables. But so far, we have only 
covered hypotheses involving a single restriction. Frequently, we wish to test multiple hypotheses 
about the underlying parameters Bo, B1, . .. , By. We begin with the leading case of testing whether a 
set of independent variables has no partial effect on a dependent variable. 


4-5a Testing Exclusion Restrictions 


We already know how to test whether a particular variable has no partial effect on the dependent vari- 
able: use the ¢ statistic. Now, we want to test whether a group of variables has no effect on the depen- 
dent variable. More precisely, the null hypothesis is that a set of variables has no effect on y, once 
another set of variables has been controlled. 

As an illustration of why testing significance of a group of variables is useful, we consider the 
following model that explains major league baseball players’ salaries: 


log(salary) = By) + B,years + B.gamesyr + B,bavg [4.28] 
+ B,hrunsyr + Bsrbisyr + u, ` 


where salary is the 1993 total salary, years is years in the league, gamesyr is average games played 
per year, bavg is career batting average (for example, bavg = 250), hrunsyr is home runs per year, 
and rbisyr is runs batted in per year. Suppose we want to test the null hypothesis that, once years in 
the league and games per year have been controlled for, the statistics measuring performance—bavg, 
hrunsyr, and rbisyr—have no effect on salary. Essentially, the null hypothesis states that productivity 
as measured by baseball statistics has no effect on salary. 

In terms of the parameters of the model, the null hypothesis is stated as 


Hy: B; = 0, By = 0, Bs = 0. [4.29] 


The null (4.29) constitutes three exclusion restrictions: if (4.29) is true, then bavg, hrunsyr, and 
rbisyr have no effect on log(salary) after years and gamesyr have been controlled for and therefore 
should be excluded from the model. This is an example of a set of multiple restrictions because we 
are putting more than one restriction on the parameters in (4.28); we will see more general examples 
of multiple restrictions later. A test of multiple restrictions is called a multiple hypotheses test or a 
joint hypotheses test. 

What should be the alternative to (4.29)? If what we have in mind is that “performance statistics 
matter, even after controlling for years in the league and games per year,” then the appropriate alterna- 
tive is simply 


H;: Ho is not true. [4.30] 


The alternative (4.30) holds if at least one of B3, B4, or B; is different from zero. (Any or all could be 
different from zero.) The test we study here is constructed to detect any violation of Hp. It is also valid 
when the alternative is something like H,: £, > 0, or B, > 0, or B; > 0, but it will not be the best 
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possible test under such alternatives. We do not have the space or statistical background necessary to 
cover tests that have more power under multiple one-sided alternatives. 

How should we proceed in testing (4.29) against (4.30)? It is tempting to test (4.29) by using the 
t statistics on the variables bavg, hrunsyr, and rbisyr to determine whether each variable is individu- 
ally significant. This option is not appropriate. A particular ¢ statistic tests a hypothesis that puts no 
restrictions on the other parameters. Besides, we would have three outcomes to contend with—one for 
each f statistic. What would constitute rejection of (4.29) at, say, the 5% level? Should all three or only 
one of the three f statistics be required to be significant at the 5% level? These are hard questions, and 
fortunately we do not have to answer them. Furthermore, using separate f statistics to test a multiple 
hypothesis like (4.29) can be very misleading. We need a way to test the exclusion restrictions jointly. 

To illustrate these issues, we estimate equation (4.28) using the data in MLB1. This gives 


eh Oa, 
log(salary) = 11.19 + .0689 years + .0126 gamesyr 


(0.29) (.0121) (.0026) 
+ .00098 bavg + .0144 hrunsyr + .0108 rbisyr [4.31] 
(.00110) (.0161) (.0072) 


n = 353, SSR = 183.186, R? = .6278, 


where SSR is the sum of squared residuals. (We will use this later.) We have left several terms after the 
decimal in SSR and R-squared to facilitate future comparisons. Equation (4.31) reveals that, whereas 
years and gamesyr are Statistically significant, none of the variables bavg, hrunsyr, and rbisyr has a 
statistically significant ¢ statistic against a two-sided alternative, at the 5% significance level. (The 
t statistic on rbisyr is the closest to being significant; its two-sided p-value is .134.) Thus, based on the 
three f statistics, it appears that we cannot reject Ho. 

This conclusion turns out to be wrong. To see this, we must derive a test of multiple restrictions 
whose distribution is known and tabulated. The sum of squared residuals now turns out to provide a 
very convenient basis for testing multiple hypotheses. We will also show how the R-squared can be 
used in the special case of testing for exclusion restrictions. 

Knowing the sum of squared residuals in (4.31) tells us nothing about the truth of the hypoth- 
esis in (4.29). However, the factor that will tell us something is how much the SSR increases when 
we drop the variables bavg, hrunsyr, and rbisyr from the model. Remember that, because the OLS 
estimates are chosen to minimize the sum of squared residuals, the SSR always increases when vari- 
ables are dropped from the model; this is an algebraic fact. The question is whether this increase is 
large enough, relative to the SSR in the model with all of the variables, to warrant rejecting the null 
hypothesis. 

The model without the three variables in question is simply 


log(salary) = By + Byyears + B,gamesyr + u. [4.32] 


In the context of hypothesis testing, equation (4.32) is the restricted model for testing (4.29); model 
(4.28) is called the unrestricted model. The restricted model always has fewer parameters than the 
unrestricted model. 

When we estimate the restricted model using the data in MLB1, we obtain 


n 
log(salary) = 11.22 + .0713 years + .0202 gamesyr 
(.11) (.0125) (.0013) 
n = 353, SSR = 198.311, R? = .5971. [4.33] 
As we surmised, the SSR from (4.33) is greater than the SSR from (4.31), and the R-squared from the 
restricted model is less than the R-squared from the unrestricted model. What we need to decide is 
whether the increase in the SSR in going from the unrestricted model to the restricted model (183.186 


to 198.311) is large enough to warrant rejection of (4.29). As with all testing, the answer depends on 
the significance level of the test. But we cannot carry out the test at a chosen significance level until we 


CHAPTER 4 Multiple Regression Analysis: Inference 141 


have a statistic whose distribution is known, and can be tabulated, under Hy. Thus, we need a way to 
combine the information in the two SSRs to obtain a test statistic with a known distribution under Hp. 

Because it is no more difficult, we might as well derive the test for the general case. Write the 
unrestricted model with k independent variables as 


y = Bo + Bixi toe + Boxy + u; [4.34] 


the number of parameters in the unrestricted model is k + 1. (Remember to add one for the inter- 
cept.) Suppose that we have q exclusion restrictions to test: that is, the null hypothesis states that q 
of the variables in (4.34) have zero coefficients. For notational simplicity, assume that it is the last q 
variables in the list of independent variables: x,_ gee ce 5 Mie (The order of the variables, of course, is 
arbitrary and unimportant.) The null hypothesis is stated as 


Ho: Br-q+1 — 0,. os » By — 0, 


which puts g exclusion restrictions on the model (4.34). The alternative to (4.35) is simply that it is 
false; this means that at least one of the parameters listed in (4.35) is different from zero. When we 
impose the restrictions under Ho, we are left with the restricted model: 


y= Bot Bx to + Pr-qXk-q +u. 


In this subsection, we assume that both the unrestricted and restricted models contain an intercept, 
because that is the case most widely encountered in practice. 

Now, for the test statistic itself. Earlier, we suggested that looking at the relative increase in the 
SSR when moving from the unrestricted to the restricted model should be informative for testing the 
hypothesis (4.35). The F statistic (or F ratio) is defined by 


[4.35] 


[4.36] 


_ (SSR, — SSR,,)/q 
~ SSR„/(n— k — 1)’ 


[4.37] 


.. GOING FURTHER 4.4 


Consider relating individual performance 
on a standardized test, score, to a variety 
of other variables. School factors include 
average class size, per-student expendi- 
tures, average teacher compensation, and 
total school enrollment. Other variables 
specific to the student are family income, 
mother’s education, father’s education, 
and number of siblings. The model is 


where SSR, is the sum of squared residuals from the 
restricted model and SSR,,,. is the sum of squared residu- 
als from the unrestricted model. 

You should immediately notice that, because SSR, 
can be no smaller than SSR, the F statistic is always 
nonnegative (and almost always strictly positive). Thus, 
if you compute a negative F statistic, then something is 
wrong; the order of the SSRs in the numerator of F has 
usually been reversed. Also, the SSR in the denomina- 
tor of F is the SSR from the unrestricted model. The 
easiest way to remember where the SSRs appear is to 
think of F as measuring the relative increase in SSR 
when moving from the unrestricted to the restricted 
model. 

The difference in SSRs in the numerator of 
F is divided by q, which is the number of restric- 


score = By + B,classize + B,expend 

+ Balchcomp + B,enroll 

+ Bsfaminc + Bysmotheduc 

+ B-fatheduc + B,siblings + u. 


State the null hypothesis that student- 
specific variables have no effect on stan- 


dardized test performance once school- 
related factors have been controlled for. 
What are k and q for this example? Write 
down the restricted version of the model. 


tions imposed in moving from the unrestricted to the 
restricted model (q independent variables are dropped). 
Therefore, we can write 


q = numerator degrees of freedom = df. — df, [4.38] 


which also shows that q is the difference in degrees of freedom between the restricted and unre- 
stricted models. (Recall that df = number of observations — number of estimated parameters. ) 
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Because the restricted model has fewer parameters—and each model is estimated using the same n 
observations—df, is always greater than df, 
The SSR in the denominator of F is divided by the degrees of freedom in the unrestricted model: 


n — k — | = denominator degrees of freedom = df. [4.39] 


In fact, the denominator of F is just the unbiased estimator of @° = Var(u) in the unrestricted model. 

In a particular application, computing the F statistic is easier than wading through the somewhat 
cumbersome notation used to describe the general case. We first obtain the degrees of freedom in the 
unrestricted model, df. Then, we count how many variables are excluded in the restricted model; this 
is g. The SSRs are reported with every OLS regression, and so forming the F statistic is simple. 

In the major league baseball salary regression, n = 353, and the full model (4.28) contains six 
parameters. Thus, n — k — 1 = df, = 353 — 6 = 347. The restricted model (4.32) contains three 
fewer independent variables than (4.28), and so q = 3. Thus, we have all of the ingredients to com- 
pute the F statistic; we hold off doing so until we know what to do with it. 

To use the F statistic, we must know its sampling distribution under the null in order to choose 
critical values and rejection rules. It can be shown that, under Hy (and assuming the CLM assump- 
tions hold), F is distributed as an F random variable with (g,n — k — 1) degrees of freedom. We write 
this as 


p= Fiki 


The distribution of F, n--1 is readily tabulated and available in statistical tables (see Table G.3) and, 
even more importantly, in statistical software. 

We will not derive the F distribution because the mathematics is very involved. Basically, it can 
be shown that equation (4.37) is actually the ratio of two independent chi-square random variables, 
divided by their respective degrees of freedom. The numerator chi-square random variable has q 
degrees of freedom, and the chi-square in the denominator has n — k — 1 degrees of freedom. This is 
the definition of an F distributed random variable (see Math Refresher B). 

It is pretty clear from the definition of F that we will reject Hy in favor of H, when F is suffi- 
ciently “large.” How large depends on our chosen significance level. Suppose that we have decided on 
a 5% level test. Let c be the 95" percentile in the F gn-k-1 distribution. This critical value depends on 
q (the numerator df) and n — k — 1 (the denominator df). It is important to keep the numerator and 
denominator degrees of freedom straight. 

The 10%, 5%, and 1% critical values for the F distribution are given in Table G.3. The rejection 
rule is simple. Once c has been obtained, we reject Hy in favor of H} at the chosen significance level if 


F>c. [4.40] 


With a 5% significance level, q = 3, and n — k — 1 = 60, the critical value is c = 2.76. We would 
reject Hy at the 5% level if the computed value of the F statistic exceeds 2.76. The 5% critical value 
and rejection region are shown in Figure 4.7. For the same degrees of freedom, the 1% critical value 
is 4.13. 

In most applications, the numerator degrees of freedom (q) will be notably smaller than the 
denominator degrees of freedom (n — k — 1). Applications where n — k — 1 is small are unlikely 
to be successful because the parameters in the unrestricted model will probably not be precisely esti- 
mated. When the denominator df reaches about 120, the F distribution is no longer sensitive to it. 
(This is entirely analogous to the ¢ distribution being well approximated by the standard normal dis- 
tribution as the df gets large.) Thus, there is an entry in the table for the denominator df = œ, and this 
is what we use with large samples (because n — k — | is then large). A similar statement holds for a 
very large numerator df, but this rarely occurs in applications. 

If Ho is rejected, then we say that x,_,4),-..,%, are jointly statistically significant (or just 
jointly significant) at the appropriate significance level. This test alone does not allow us to say 
which of the variables has a partial effect on y; they may all affect y or maybe only one affects y. 
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Figure 4.7 The 5% critical value and rejection region in an F; sọ distribution. 


area = .95 


276 rejection 
i region 


If the null is not rejected, then the variables are jointly insignificant, which often justifies dropping 
them from the model. 

For the major league baseball example with three numerator degrees of freedom and 347 denom- 
inator degrees of freedom, the 5% critical value is 2.60, and the 1% critical value is 3.78. We reject Ho 
at the 1% level if F is above 3.78; we reject at the 5% level if F is above 2.60. 

We are now in a position to test the hypothesis that we began this section with: after control- 
ling for years and gamesyr, the variables bavg, hrunsyr, and rbisyr have no effect on players’ sala- 
ries. In practice, it is easiest to first compute (SSR, — SSR,,.)/SSR,,. and to multiply the result by 
(n — k — 1)/q; the reason the formula is stated as in (4.37) is that it makes it easier to keep the 
numerator and denominator degrees of freedom straight. Using the SSRs in (4.31) and (4.33), we have 


F (198.311 — 183.186) 347 


183.186 dis 


This number is well above the 1% critical value in the F distribution with 3 and 347 degrees of free- 
dom, and so we soundly reject the hypothesis that bavg, hrunsyr, and rbisyr have no effect on salary. 

The outcome of the joint test may seem surprising in light of the insignificant ¢ statistics for 
the three variables. What is happening is that the two variables hrunsyr and rbisyr are highly cor- 
related, and this multicollinearity makes it difficult to uncover the partial effect of each variable; this 
is reflected in the individual ż statistics. The F statistic tests whether these variables (including bavg) 
are jointly significant, and multicollinearity between hrunsyr and rbisyr is much less relevant for test- 
ing this hypothesis. In Computer Exercise C5, you are asked to reestimate the model while dropping 
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rbisyr, in which case hrunsyr becomes very significant. The same is true for rbisyr when hrunsyr is 
dropped from the model. 

The F statistic is often useful for testing exclusion of a group of variables when the variables 
in the group are highly correlated. For example, suppose we want to test whether firm performance 
affects the salaries of chief executive officers. There are many ways to measure firm performance, and 
it probably would not be clear ahead of time which measures would be most important. Because mea- 
sures of firm performance are likely to be highly correlated, hoping to find individually significant 
measures might be asking too much due to multicollinearity. But an F test can be used to determine 
whether, as a group, the firm performance variables affect salary. 


4-5b Relationship between F and f Statistics 


We have seen in this section how the F statistic can be used to test whether a group of variables 
should be included in a model. What happens if we apply the F statistic to the case of testing signif- 
icance of a single independent variable? This case is certainly not ruled out by the previous devel- 
opment. For example, we can take the null to be Ho: 6, = 0 and q = 1 (to test the single exclusion 
restriction that x, can be excluded from the model). From Section 4-2, we know that the ¢ statistic 
on 6; can be used to test this hypothesis. The question, then, is: do we have two separate ways of 
testing hypotheses about a single coefficient? The answer is no. It can be shown that the F statis- 
tic for testing exclusion of a single variable is equal to the square of the corresponding t statistic. 
Because #_,_, has an F in—x-1 distribution, the two approaches lead to exactly the same outcome, 
provided that the alternative is two-sided. The f statistic is more flexible for testing a single hypoth- 
esis because it can be directly used to test against one-sided alternatives. Because ż statistics are 
also easier to obtain than F statistics, there is really no reason to use an F statistic to test hypotheses 
about a single parameter. 

We have already seen in the salary regressions for major league baseball players that two (or 
more) variables that each have insignificant f statistics can be jointly very significant. It is also pos- 
sible that, in a group of several explanatory variables, one variable has a significant ¢ statistic but the 
group of variables is jointly insignificant at the usual significance levels. What should we make of 
this kind of outcome? For concreteness, suppose that in a model with many explanatory variables we 
cannot reject the null hypothesis that 64, 62, 63, B4, and Bs are all equal to zero at the 5% level, yet 
the f statistic for B , is significant at the 5% level. Logically, we cannot have 6, # 0 but also have 4, 
B2, B3, B4, and Bs all equal to zero! But as a matter of testing, it is possible that we can group a bunch 
of insignificant variables with a significant variable and conclude that the entire set of variables is 
jointly insignificant. (Such possible conflicts between a f test and a joint F test give another example 
of why we should not “accept” null hypotheses; we should only fail to reject them.) The F statistic is 
intended to detect whether a set of coefficients is different from zero, but it is never the best test for 
determining whether a single coefficient is different from zero. The f test is best suited for testing a 
single hypothesis. (In statistical terms, an F statistic for joint restrictions including B, = 0 will have 
less power for detecting 8, # O than the usual f statistic. See Section C-6 in Math Refresher C for a 
discussion of the power of a test.) 

Unfortunately, the fact that we can sometimes hide a statistically significant variable along with 
some insignificant variables could lead to abuse if regression results are not carefully reported. For 
example, suppose that, in a study of the determinants of loan-acceptance rates at the city level, x, 
is the fraction of black households in the city. Suppose that the variables x2, x3, x4, and x; are the 
fractions of households headed by different age groups. In explaining loan rates, we would include 
measures of income, wealth, credit ratings, and so on. Suppose that age of household head has no 
effect on loan approval rates, once other variables are controlled for. Even if race has a margin- 
ally significant effect, it is possible that the race and age variables could be jointly insignificant. 
Someone wanting to conclude that race is not a factor could simply report something like “Race 
and age variables were added to the equation, but they were jointly insignificant at the 5% level.” 
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Hopefully, peer review prevents these kinds of misleading conclusions, but you should be aware that 
such outcomes are possible. 

Often, when a variable is very statistically significant and it is tested jointly with another set of 
variables, the set will be jointly significant. In such cases, there is no logical inconsistency in rejecting 
both null hypotheses. 


4-5c The R-Squared Form of the F Statistic 


For testing exclusion restrictions, it is often more convenient to have a form of the F statistic that can 
be computed using the R-squareds from the restricted and unrestricted models. One reason for this is 
that the R-squared is always between zero and one, whereas the SSRs can be very large depending on 
the unit of measurement of y, making the calculation based on the SSRs tedious. Using the fact that 
SSR, = SST(1 — R?) and SSR,, = SST(1 — R2,), we can substitute into (4.37) to obtain 


ur 


R, — R)/ R,, — R2)/ 
we (R Ja  _ : iq 4.41] 
Gl z R2.)/(n She 1) (1 m Rir) ldfar 


(note that the SST terms cancel everywhere). This is called the R-squared form of the F statistic. 
[At this point, you should be cautioned that although equation (4.41) is very convenient for testing 
exclusion restrictions, it cannot be applied for testing all linear restrictions. As we will see when we 
discuss testing general linear restrictions, the sum of squared residuals form of the F statistic is some- 
times needed. ] 

Because the R-squared is reported with almost all regressions (whereas the SSR is not), it is 
easy to use the R-squareds from the unrestricted and restricted models to test for exclusion of some 
variables. Particular attention should be paid to the order of the R-squareds in the numerator: the unre- 
stricted R-squared comes first [contrast this with the SSRs in (4.37)]. Because R2, > R2, this shows 
again that F will always be positive. 

In using the R-squared form of the test for excluding a set of variables, it is important to not 
square the R-squared before plugging it into formula (4.41); the squaring has already been done. 
All regressions report R°, and these numbers are plugged directly into (4.41). For the baseball salary 
example, we can use (4.41) to obtain the F statistic: 


(6278 — 5971) 347 _ 
(1 — 6278) 3 


F = 9.54, 


which is very close to what we obtained before. (The difference is due to rounding error.) 


EXAMPLE 4.9 Parents’ Education in a Birth Weight Equation 


As another example of computing an F statistic, consider the following model to explain child birth 
weight in terms of various factors: 


bwght = Bo + Bicigs + Bparity + B3faminc 
+ Bymotheduc + Bfatheduc + u, [4.42] 


where 


bwght = birth weight, in pounds. 
cigs = average number of cigarettes the mother smoked per day during pregnancy. 
parity = the birth order of this child. 
faminc = annual family income. 
motheduc = years of schooling for the mother. 
fatheduc = years of schooling for the father. 
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Let us test the null hypothesis that, after controlling for cigs, parity, and faminc, parents’ education 
has no effect on birth weight. This is stated as Hy: 64 = 0, 6; = 0, and so there are q = 2 exclusion 
restrictions to be tested. There are k + 1 = 6 parameters in the unrestricted model (4.42); so the df 
in the unrestricted model is n — 6, where n is the sample size. 

We will test this hypothesis using the data in BWGHT. This data set contains information on 
1,388 births, but we must be careful in counting the observations used in testing the null hypothesis. 
It turns out that information on at least one of the variables motheduc and fatheduc is missing for 197 
births in the sample; these observations cannot be included when estimating the unrestricted model. 
Thus, we really have n = 1,191 observations, and so there are 1,191 — 6 = 1,185 df in the unre- 
stricted model. We must be sure to use these same 1,191 observations when estimating the restricted 
model (not the full 1,388 observations that are available). Generally, when estimating the restricted 
model to compute an F test, we must use the same observations to estimate the unrestricted model; 
otherwise, the test is not valid. When there are no missing data, this will not be an issue. 

The numerator df is 2, and the denominator df is 1,185; from Table G.3, the 5% critical value 
is c = 3.0. Rather than report the complete results, for brevity, we present only the R-squareds. 
The R-squared for the full model turns out to be R2, = .0387. When motheduc and fathe- 
duc are dropped from the regression, the R-squared falls to R? = .0364. Thus, the F statistic is 
F = [(.0387 — .0364)/(1 — .0387) ](1,185/2) = 1.42; because this is well below the 5% critical 
value, we fail to reject Ho. In other words, motheduc and fatheduc are jointly insignificant in the birth 
weight equation. Most statistical packages have built-in commands for testing multiple hypotheses 
after OLS estimation, and so one need not worry about making the mistake of running the two regres- 
sions on different data sets. Typically, the commands are applied after estimation of the unrestricted 
model, which means the smaller subset of data is used whenever there are missing values on some 
variables. Formulas for computing the F statistic using matrix algebra—see Advanced Treatment E— 
do not require estimation of the restricted model. 


4-5d Computing p-Values for F Tests 


For reporting the outcomes of F tests, p-values are especially useful. Because the F distribution depends 
on the numerator and denominator df, it is difficult to get a feel for how strong or weak the evidence is 
m against the null hypothesis simply by looking at the 
value of the F statistic and one or two critical values. 

The ata A ATEND were ee O eiM In the F testing context, the p-value is defined as 
the two equations p-value = P(¥ > F), [4.43] 


where, for emphasis, we let F denote an F random 
variable with (q,n — k — 1) degrees of freedom, and 
F is the actual value of the test statistic. The p-value 
still has the same interpretation as it did for t statis- 


rae 
atndrte = 47.13 + 13.87 priGPA 
(2.87) (1.09) 
n = 680, R? = .183 


and 


E 
atndrte = 75.70 + 17.26 priGPA — 1.72 ACT 
(3.88) (1.08) (?) 
i = G80) FP = 2S, 


where, as always, standard errors are in 
parentheses; the standard error for ACT is 
missing in the second equation. What is the 
t statistic for the coefficient on ACT ? (Hint: 
First compute the F statistic for significance 
of ACT.) 


tics: it is the probability of observing a value of F at 
least as large as we did, given that the null hypoth- 
esis is true. A small p-value is evidence against Hp. 
For example, p-value = .016 means that the chance 
of observing a value of F as large as we did when 
the null hypothesis was true is only 1.6%; we usually 
reject Hy in such cases. If the p-value = .314, then 
the chance of observing a value of the F statistic as 
large as we did under the null hypothesis is 31.4%. 
Most would find this to be pretty weak evidence 
against Hp. 
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As with ż testing, once the p-value has been computed, the F test can be carried out at any signifi- 
cance level. For example, if the p-value = .024, we reject Ho at the 5% significance level but not at 
the 1% level. 

The p-value for the F test in Example 4.9 is .238, and so the null hypothesis that B,,o:reduc and 
Bratheaue are both Zero is not rejected at even the 20% significance level. 

Many econometrics packages have a built-in feature for testing multiple exclusion restric- 
tions. These packages have several advantages over calculating the statistics by hand: we will less 
likely make a mistake, p-values are computed automatically, and the problem of missing data, as in 
Example 4.9, is handled without any additional work on our part. 


4-5e The F Statistic for Overall Significance of a Regression 


A special set of exclusion restrictions is routinely tested by most regression packages. These restric- 
tions have the same interpretation, regardless of the model. In the model with k independent variables, 
we can write the null hypothesis as 


Ho: x1, X2, . . - , Xy do not help to explain y. 


This null hypothesis is, in a way, very pessimistic. It states that none of the explanatory variables has 
an effect on y. Stated in terms of the parameters, the null is that all slope parameters are zero: 


Ho: Bi = B, =~: = Pr = 0, [4.44] 
and the alternative is that at least one of the £; is different from zero. Another useful way of stating the 
null is that Hp: E(y|x), x2,...,%,) = E(y), so that knowing the values of x), x5, . . . , x; does not affect 


the expected value of y. 
There are k restrictions in (4.44), and when we impose them, we get the restricted model 


y = bo + u; [4.45] 
all independent variables have been dropped from the equation. Now, the R-squared from estimating 


(4.45) is zero; none of the variation in y is being explained because there are no explanatory variables. 
Therefore, the F statistic for testing (4.44) can be written as 


R7/k 
(1 -— R)/(n-—k-1) 


[4.46] 


where R? is just the usual R-squared from the regression of y on x), X,... , Xp- 

Most regression packages report the F statistic in (4.46) automatically, which makes it tempting 
to use this statistic to test general exclusion restrictions. You must avoid this temptation. The F statis- 
tic in (4.41) is used for general exclusion restrictions; it depends on the R-squareds from the restricted 
and unrestricted models. The special form of (4.46) is valid only for testing joint exclusion of all inde- 
pendent variables. This is sometimes called determining the overall significance of the regression. 

If we fail to reject (4.44), then there is no evidence that any of the independent variables help to 
explain y. This usually means that we must look for other variables to explain y. For Example 4.9, the 
F statistic for testing (4.44) is about 9.55 with k = 5 andn — k — 1 = 1,185 df. The p-value is zero to 
four places after the decimal point, so that (4.44) is rejected very strongly. Thus, we conclude that the 
variables in the bwght equation do explain some variation in bwght. The amount explained is not large: 
only 3.87%. But the seemingly small R-squared results in a highly significant F statistic. That is why we 
must compute the F statistic to test for joint significance and not just look at the size of the R-squared. 

Occasionally, the F statistic for the hypothesis that all independent variables are jointly insignifi- 
cant is the focus of a study. Problem 10 asks you to use stock return data to test whether stock returns 
over a four-year horizon are predictable based on information known only at the beginning of the 
period. Under the efficient markets hypothesis, the returns should not be predictable; the null hypoth- 
esis is precisely (4.44). 
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4-5f Testing General Linear Restrictions 


Testing exclusion restrictions is by far the most important application of F statistics. Sometimes, how- 
ever, the restrictions implied by a theory are more complicated than just excluding some independent 
variables. It is still straightforward to use the F statistic for testing. 

As an example, consider the following equation: 


log(price) = By + Bilog(assess) + Bplog(lotsize) [4.47] 
+ B3log(sqrft) + B,bdrms + u, j 


where 


price = house price. 
assess = the assessed housing value(before the house was sold). 
lotsize = size of the lot, in square feet. 

sqrft = square footage. 
bdrms = number of bedrooms. 


Now, suppose we would like to test whether the assessed housing price is a rational valuation. If 
this is the case, then a 1% change in assess should be associated with a 1% change in price; that is, 
Bı = 1. In addition, lotsize, sqrft, and bdrms should not help to explain log(price), once the assessed 
value has been controlled for. Together, these hypotheses can be stated as 


Ho: B; = 1, B. = 0, Bs = 0, By = 0. [4.48] 


Four restrictions have to be tested; three are exclusion restrictions, but 8B; = 1 is not. How can we test 
this hypothesis using the F statistic? 

As in the exclusion restriction case, we estimate the unrestricted model, (4.47) in this case, and 
then impose the restrictions in (4.48) to obtain the restricted model. It is the second step that can be a 
little tricky. But all we do is plug in the restrictions. If we write (4.47) as 


y = Bo + Bix, + Box + B3x3 + Byx4 + u, [4.49] 


then the restricted model is y = By + x, + u. Now, to impose the restriction that the coefficient on x, 
is unity, we must estimate the following model: 


y—X = potu. [4.50] 


This is just a model with an intercept (Bọ) but with a different dependent variable than in (4.49). 
The procedure for computing the F statistic is the same: estimate (4.50), obtain the SSR(SSR,), 
and use this with the unrestricted SSR from (4.49) in the F statistic (4.37). We are test- 
ing q = 4 restrictions, and there are n — 5 df in the unrestricted model. The F statistic is simply 
[(SSR, — SSR,,.)/SSR,,][(1 — 5)/4]. 

Before illustrating this test using a data set, we must emphasize one point: we cannot use the 
R-squared form of the F statistic for this example because the dependent variable in (4.50) is different 
from the one in (4.49). This means the total sum of squares from the two regressions will be different, 
and (4.41) is no longer equivalent to (4.37). As a general rule, the SSR form of the F statistic should 
be used if a different dependent variable is needed in running the restricted regression. 

The estimated unrestricted model using the data in HPRICE1 is 


a pe 
log(price) = .264 + 1.043 log(assess) + .0074 log(lotsize) 


(570) (.151) (.0386) 
— .1032 log(sgrft) + .0338 bdrms 
(.1384) (.0221) 


n = 88, SSR = 1.822, R? = .773. 
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If we use separate f statistics to test each hypothesis in (4.48), we fail to reject each one. But rationality of 
the assessment is a joint hypothesis, so we should test the restrictions jointly. The SSR from the restricted 
model turns out to be SSR, = 1.880, and so the F statistic is [(1.880 — 1.822)/1.822](83/4) = .661. 
The 5% critical value in an F distribution with (4,83) dfis about 2.50, and so we fail to reject Hp. There 
is essentially no evidence against the hypothesis that the assessed values are rational. 


4-6 Reporting Regression Results 


We end this chapter by providing a few guidelines on how to report multiple regression results for 
relatively complicated empirical projects. This should help you to read published works in the applied 
social sciences, while also preparing you to write your own empirical papers. We will expand on this 
topic in the remainder of the text by reporting results from various examples, but many of the key 
points can be made now. 

Naturally, the estimated OLS coefficients should always be reported. For the key variables in an 
analysis, you should interpret the estimated coefficients (which often requires knowing the units of 
measurement of the variables). For example, is an estimate an elasticity, or does it have some other 
interpretation that needs explanation? The economic or practical importance of the estimates of the 
key variables should be discussed. 

The standard errors should always be included along with the estimated coefficients. Some 
authors prefer to report the ¢ statistics rather than the standard errors (and sometimes just the abso- 
lute value of the ż statistics). Although nothing is really wrong with this, there is some preference for 
reporting standard errors. First, it forces us to think carefully about the null hypothesis being tested; 
the null is not always that the population parameter is zero. Second, having standard errors makes it 
easier to compute confidence intervals. 

The R-squared from the regression should always be included. We have seen that, in addition 
to providing a goodness-of-fit measure, it makes calculation of F statistics for exclusion restrictions 
simple. Reporting the sum of squared residuals and the standard error of the regression is sometimes 
a good idea, but it is not crucial. The number of observations used in estimating any equation should 
appear near the estimated equation. 

If only a couple of models are being estimated, the results can be summarized in equation form, 
as we have done up to this point. However, in many papers, several equations are estimated with many 
different sets of independent variables. We may estimate the same equation for different groups of 
people, or even have equations explaining different dependent variables. In such cases, it is better to 
summarize the results in one or more tables. The dependent variable should be indicated clearly in the 
table, and the independent variables should be listed in the first column. Standard errors (or t statis- 
tics) can be put in parentheses below the estimates. 


EXAMPLE 4.10 Salary-Pension Tradeoff for Teachers 


Let totcomp denote average total annual compensation for a teacher, including salary and all fringe 
benefits (pension, health insurance, and so on). Extending the standard wage equation, total compen- 
sation should be a function of productivity and perhaps other characteristics. As is standard, we use 
logarithmic form: 


log(totcomp) = f(productivity characteristics, other factors), 


where f(-) is some function (unspecified for now). Write 


a) 


totcomp = salary + benefits = salary (1 ats 
salary 
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This equation shows that total compensation is the product of two terms: salary and | + b/s, 
where b/s is shorthand for the “benefits to salary ratio.” Taking the log of this equation gives 
log(totcomp) = log(salary) + log(1 + b/s). Now, for “small” b/s, log(1 + b/s) ~= b/s; we will 
use this approximation. This leads to the econometric model 


log(salary) = By + B,(b/s) + other factors. 


Testing the salary-benefits tradeoff then is the same as a test of Hp: 8B; = —1 against H;: B, # —1. 

We use the data in MEAP93 to test this hypothesis. These data are averaged at the school level, 
and we do not observe very many other factors that could affect total compensation. We will include 
controls for size of the school (enroll), staff per thousand students (staff), and measures such as the 
school dropout and graduation rates. The average b/s in the sample is about .205, and the largest value 
is .450. 

The estimated equations are given in Table 4.1, where standard errors are given in parentheses 
below the coefficient estimates. The key variable is b/s, the benefits-salary ratio. 

From the first column in Table 4.1, we see that, without controlling for any other factors, the 
OLS coefficient for b/s is —.825. The f statistic for testing the null hypothesis Ho: B; = —1 is 
t = (—.825 + 1)/.200 = .875, and so the simple 
regression fails to reject Hp. After adding controls for 
GOING FURTHER 4.6 school size and staff size (which roughly captures the 
number of students taught by each teacher), the esti- 
mate of the b/s coefficient becomes —.605. Now, the 
tradeoff? Are these variables jointly signifi- | test of 8, = —1 gives at statistic of about 2.39; thus, 
cant at the 5% level? What about the 10% | Ho is rejected at the 5% level against a two-sided 
level? alternative. The variables log(enroll) and log(staff) 
are very statistically significant. 


How does adding droprate and gradrate 
affect the estimate of the salary-benefits 


TABLE 4.1 Testing the Salary-Benefits Tradeoff 


Dependent Variable: log(salary) 
Independent Variables (1) (2) (3) 
b/s —.825 —.605 —.589 
(.200) (.165) (.165) 
log(enroll) — .0874 .0881 
(.0073) (.0073) 
log(staff) —— T222 -.218 
(.050) (.050) 
droprate — — —.00028 
(.00161) 
gradrate — — .00097 
.00066) 
intercept 10.523 10.884 10.738 
(0.042) (0.252) (0.258) 
Observations 408 408 408 
R-squared .040 399 .361 


CHAPTER 4 Multiple Regression Analysis: Inference 151 


4-7 Revisiting Causal Effects and Policy Analysis 


In Section 3-7e we showed how multiple regression can be used to obtain unbiased estimators of 
causal, or treatment, effects in the context of policy interventions, provided we have controls suffi- 
cient to ensure that participation assignment is unconfounded. In particular, with a constant treatment 
effect, 7, we derived 


EQ|w,x) = a + tw + xy =a + Tw + yxy too E YAp 


where y is the outcome or response, w is the binary policy (treatment) variable, and the x; are the 
controls that account for nonrandom assignment. We know that the OLS estimator of 7 is unbi- 
ased because MLR.1 and MLR.4 hold (and we have random sampling from the population). If we 
add MLR.5 and MLR.6, we can perform exact inference on 7. For example, the null hypothesis 
of no policy effect is Họ: 7 = 0, and we can test this hypothesis—against a one-sided or two-sided 
alternative—using a standard ż statistic. Regardless of the magnitude of the estimate 7, most research- 
ers and administrators will not be convinced that an intervention or policy is effective unless 7 is 
statistically different from zero (and with the expected sign) at a sufficiently small significance level. 
As in any context, it is important to discuss the sign and magnitude of 7 in addition to its statistical 
significance. Probably of more interest is to obtain a 95% confidence interval for 7, which gives us a 
plausible range of values for the population treatment effect. 

We can also test hypotheses about the y;, but, in a policy environment, we are rarely concerned 
about the statistical significance of the x; except perhaps as a logical check on the regression results. For 
example, we should expect past labor market earnings to positively predict current labor market earnings. 

We now revisit Example 3.7, which contains the estimated effect of a job training program using 
JTRAIN98. 


Evaluating a Job Training Program 


We reproduce the simple and multiple regression estimates and now put the standard errors below the 
coefficients. Recall that the outcome variable, earn98, is measured in thousands of dollars: 
éarn98 = 10.61 — 2.05 train [4.51] 
(0.28) (0.48) 
n = 1,130, R? = 0.016 


earn98 = 4.67 + 2.41 train + .373 earn96 + .363 educ — .181 age + 2.48 married [4.52] 
(1.15) (0.44) (019) (.064) (019) (0.43) 
n = 1,130, R? = 0. 405 


As discussed in Example 3.7, the change in the sign of coefficient on train is striking when mov- 
ing from simple to multiple regression. Moreover, the f statistic in (4.51) is —2.05/0.48 =~ —4.27, 
which gives a very statistically significant and practically large negative effect of the program. By 
contrast, the ¢ statistic in (4.52) is about 5.47, which shows a strongly statistically significant and 
positive effect. It is pretty clear that we prefer the multiple regression results for evaluating the job 
training program. Of course, it could be that we have omitted some important controls in (4.52), but at 
a minimum we know that we can account for some important differences across workers. 


Perhaps now is a good time to revisit the the multicollinearity issue, which we raised in 
Section 3-4a. Recall that collinearity arises only in the context of multiple regression, and so our dis- 
cussion is relevant only for equation (4.52). In this equation, it could be that two or more of the control 
variables in (4.52) are highly correlated; or maybe not. The point is that we do not care. The reason we 
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include earn96, educ, age, and married is to control for differences in men that at least partly determine 
participation in the job training program, hopefully leading to an unbiased estimator of the treatment 
effect. We are not worried about how well we estimate the coefficients on the control variables, and 
including highly correlated variables among the x; has nothing to do with obtaining a reliable estimate 
of T. Computer Exercise C14 asks you to add a binary variable for whether the man was unemployed in 
1996, which is strongly related to earnings in 1996: unem96 = 0 means earn96 = 0. And yet, correla- 
tion between unem96 and earn96 is of essentially no concern. If we could observe earnings in 1995, 
earn95, we would likely include it, too, even though it is likely to be highly correlated with earn96. 


Summary 


In this chapter, we have covered the very important topic of statistical inference, which allows us to infer 
something about the population model from a random sample. We summarize the main points: 


1. 


Zs 
3. 


11. 


12. 


13 


Under the classical linear model assumptions MLR.1 through MLR.6, the OLS estimators are 
normally distributed. 

Under the CLM assumptions, the f statistics have ¢ distributions under the null hypothesis. 

We use ż statistics to test hypotheses about a single parameter against one- or two-sided alternatives, 
using one- or two-tailed tests, respectively. The most common null hypothesis is Ho: 6; = 0, but we 
sometimes want to test other values of 6; under Hp. 

In classical hypothesis testing, we first choose a significance level, which, along with the df and alter- 
native hypothesis, determines the critical value against which we compare the ż statistic. It is more 
informative to compute the p-value for a t test—the smallest significance level for which the null 
hypothesis is rejected—so that the hypothesis can be tested at any significance level. 

Under the CLM assumptions, confidence intervals can be constructed for each B;. These CIs can be 
used to test any null hypothesis concerning £; against a two-sided alternative. 

Single hypothesis tests concerning more than one £; can always be tested by rewriting the model to 
contain the parameter of interest. Then, a standard ż statistic can be used. 

The F statistic is used to test multiple exclusion restrictions, and there are two equivalent forms of the 
test. One is based on the SSRs from the restricted and unrestricted models. A more convenient form is 
based on the R-squareds from the two models. 

When computing an F statistic, the numerator df is the number of restrictions being tested, while the 
denominator df is the degrees of freedom in the unrestricted model. 

The alternative for F testing is two-sided. In the classical approach, we specify a significance level 
which, along with the numerator df and the denominator df, determines the critical value. The null 
hypothesis is rejected when the statistic, F, exceeds the critical value, c. Alternatively, we can compute 
a p-value to summarize the evidence against Hp. 

General multiple linear restrictions can be tested using the sum of squared residuals form of the 
F statistic. 

The F statistic for the overall significance of a regression tests the null hypothesis that all slope param- 
eters are zero, with the intercept unrestricted. Under Hy, the explanatory variables have no effect on 
the expected value of y. 

When data are missing on one or more explanatory variables, one must be careful when computing 
F statistics “by hand,” that is, using either the sum of squared residuals or R-squareds from the two 
regressions. Whenever possible it is best to leave the calculations to statistical packages that have 
built-in commands, which work with or without missing data. 

Statistical inference is important for program evaluation and policy analysis. Rarely is it enough to 
report only the economic (or practical) significance of our estimates. We, and others, must be con- 
vinced that moderate to large estimates of treatment effects are not due purely to sampling variation. 
Obtaining a p-value for the null hypothesis that the effect is zero, or, even better, obtaining a 95% 
confidence interval, allows us to determine statistic significance in addition to economic significance. 
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THE CLASSICAL LINEAR MODEL ASSUMPTIONS 


Now is a good time to review the full set of classical linear model (CLM) assumptions for cross-sectional 
regression. Following each assumption is a comment about its role in multiple regression analysis. 


Assumption MLR.1 (Linear in Parameters) 
The model in the population can be written as 


y = Bo + Bixi + Boxy ++ + Bex, + u, 


where fo, B,,..., Bp are the unknown parameters (constants) of interest and u is an unobserved random 
error or disturbance term. 

Assumption MLR.1 describes the population relationship we hope to estimate, and explicitly sets out 
the 6,—the ceteris paribus population effects of the x; on y—as the parameters of interest. 


Assumption MLR.2 (Random Sampling) 
We have a random sample of n observations, Alkis Xis <- -s Xip y): i=l... nh}; following the popula- 
tion model in Assumption MLR. 1. 

This random sampling assumption means that we have data that can be used to estimate the 6;, and 
that the data have been chosen to be representative of the population described in Assumption MLR.1. 


Assumption MLR.3 (No Perfect Collinearity) 
In the sample (and therefore in the population), none of the independent variables is constant, and there are 
no exact linear relationships among the independent variables. 

Once we have a sample of data, we need to know that we can use the data to compute the OLS esti- 
mates, the Ê; This is the role of Assumption MLR.3: if we have sample variation in each independent vari- 
able and no exact linear relationships among the independent variables, we can compute the Ê.. 


Assumption MLR.4 (Zero Conditional Mean) 
The error u has an expected value of zero given any values of the explanatory variables. In other words, 
E(ulx,, X2- -, Xp) = 0. 

As we discussed in the text, assuming that the unobserved factors are, on average, unrelated to the 
explanatory variables is key to deriving the first statistical property of each OLS estimator: its unbiasedness 
for the corresponding population parameter. Of course, all of the previous assumptions are used to show 
unbiasedness. 


Assumption MLR.5 (Homoskedasticity) 
The error u has the same variance given any values of the explanatory variables. In other words, 


Var(ulx,,%2,...,%,) = 07. 

Compared with Assumption MLR.4, the homoskedasticity assumption is of secondary importance; in 
particular, Assumption MLR.5 has no bearing on the unbiasedness of the Ê, Still, homoskedasticity has 
two important implications: (1) We can derive formulas for the sampling variances whose components are 
easy to characterize; (2) We can conclude, under the Gauss-Markov assumptions MLR.1 through MLR.5, 
that the OLS estimators have smallest variance among all linear, unbiased estimators. 


Assumption MLR.6 (Normality) 
The population error u is independent of the explanatory variables x,, x, . . . , x, and is normally distributed 
with zero mean and variance g°: u ~ Normal(0, o°). 

In this chapter, we added Assumption MLR.6 to obtain the exact sampling distributions of ¢ statis- 
tics and F statistics, so that we can carry out exact hypotheses tests. In the next chapter, we will see that 
MLR.6 can be dropped if we have a reasonably large sample size. Assumption MLR.6 does imply a stron- 
ger efficiency property of OLS: the OLS estimators have smallest variance among all unbiased estimators; 
the comparison group is no longer restricted to estimators linear in the {y;: i = 1,2,...,n}. 
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Key Terms 


Alternative Hypothesis 

Classical Linear Model 

Classical Linear Model (CLM) 
Assumptions 

Confidence Interval (CI) 

Critical Value 

Denominator Degrees of Freedom 

Economic Significance 

Exclusion Restrictions 

F Statistic 

Joint Hypotheses Test 

Jointly Insignificant 


Problems 


Jointly Statistically Significant 

Minimum Variance Unbiased 
Estimators 

Multiple Hypotheses Test 

Multiple Restrictions 

Normality Assumption 

Null Hypothesis 

Numerator Degrees of Freedom 

One-Sided Alternative 

One-Tailed Test 

Overall Significance of the Regression 

p-Value 


Practical Significance 
R-squared Form of the F Statistic 
Rejection Rule 
Restricted Model 
Significance Level 
Statistically Insignificant 
Statistically Significant 

t Ratio 

t Statistic 

Two-Sided Alternative 
Two-Tailed Test 
Unrestricted Model 


1 Which of the following can cause the usual OLS f statistics to be invalid (that is, not to have r distribu- 


tions under H,)? 


(i) | Heteroskedasticity. 
(ii) A sample correlation coefficient of .95 between two independent variables that are in the model. 
(iii) Omitting an important explanatory variable. 


2 Consider an equation to explain salaries of CEOs in terms of annual firm sales, return on equity (roe, in 
percentage form), and return on the firm’s stock (ros, in percentage form): 


log(salary) = By + B,log(sales) + Broe + Bros + u. 


(i) In terms of the model parameters, state the null hypothesis that, after controlling for sales and 
roe, ros has no effect on CEO salary. State the alternative that better stock market performance 
increases a CEO’s salary. 

(ii) Using the data in CEOSALI, the following equation was obtained by OLS: 


n = 209, R? = .283. 


log(salary) = 4.32 + .280 log(sales) + .0174 roe + .00024 ros 
(.32) (.035) (.0041) 


(.00054) 


By what percentage is salary predicted to increase if ros increases by 50 points? Does ros have 
a practically large effect on salary? 
(iii) Test the null hypothesis that ros has no effect on salary against the alternative that ros has a 
positive effect. Carry out the test at the 10% significance level. 
(iv) Would you include ros in a final model explaining CEO compensation in terms of firm perfor- 
mance? Explain. 


3 The variable rdintens is expenditures on research and development (R&D) as a percentage of sales. 
Sales are measured in millions of dollars. The variable profmarg is profits as a percentage of sales. 
Using the data in RDCHEM for 32 firms in the chemical industry, the following equation 


is estimated: 


—_— 
rdintens = 472 + .321 log(sales) + .050 profmarg 


(1.369) (.216) 
n = 32, R? = .099. 


(.046) 


(i) Interpret the coefficient on log(sales). In particular, if sales increases by 10%, what is the esti- 
mated percentage point change in rdintens? Is this an economically large effect? 
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(ii) Test the hypothesis that R&D intensity does not change with sales against the alternative that it 
does increase with sales. Do the test at the 5% and 10% levels. 

(iii) Interpret the coefficient on profmarg. Is it economically large? 

(iv) Does profmarg have a statistically significant effect on rdintens ? 


4 Are rent rates influenced by the student population in a college town? Let rent be the average monthly 
rent paid on rental units in a college town in the United States. Let pop denote the total city population, 
avginc the average city income, and pctstu the student population as a percentage of the total popula- 
tion. One model to test for a relationship is 


log(rent) = By + B,log(pop) + Bolog(avginc) + Bypctstu + u. 
(i) State the null hypothesis that size of the student body relative to the population has no ceteris 
paribus effect on monthly rents. State the alternative that there is an effect. 
(ii) What signs do you expect for 6, and By? 
(iii) The equation estimated using 1990 data from RENTAL for 64 college towns is 


—_— 
log(rent) = .043 + .066 log(pop) + .507 log(avginc) + .0056 pctstu 
(.844) (.039) (.081) (.0017) 
n = 64, R? = 458. 
What is wrong with the statement: “A 10% increase in population is associated with about a 
6.6% increase in rent”? 
(iv) Test the hypothesis stated in part (i) at the 1% level. 


5 Consider the estimated equation from Example 4.3, which can be used to study the effects of skipping 
class on college GPA: 
-n 
colGPA = 1.39 + .412 hsGPA + .015 ACT — .083 skipped 
(.33) (.094) (.011) (.026) 
n = 141, R? = .234. 
(i) Using the standard normal approximation, find the 95% confidence interval for B,,¢p,. 


(ii) Can you reject the hypothesis Ho: Brsgpra = .4 against the two-sided alternative at the 5% level? 
(iii) Can you reject the hypothesis Ho: B,,gp, = 1 against the two-sided alternative at the 5% level? 


6 In Section 4-5, we used as an example testing the rationality of assessments of housing prices. There, 
we used a log-log model in price and assess [see equation (4.47)]. Here, we use a level-level formulation. 
Gi) In the simple regression model 


price = By + Byassess + u, 
the assessment is rational if 8, = 1 and 6, = 0. The estimated equation is 
price = —14.47 + 976 assess 
(16.27) (.049) 
n = 88, SSR = 165,644.51, R? = .820. 
First, test the hypothesis that Hp: By = 0 against the two-sided alternative. Then, test Hp: B, = 1 
against the two-sided alternative. What do you conclude? 

Gi) To test the joint hypothesis that By) = 0 and 6, = 1, we need the SSR in the restricted model. 
This amounts to computing >7_ (price; — assess;)”, where n = 88, because the residuals in 
the restricted model are just price; — assess;. (No estimation is needed for the restricted model 
because both parameters are specified under Ho.) This turns out to yield SSR = 209,448.99. 
Carry out the F test for the joint hypothesis. 

(iii) Now, test Hp: 62 = 0, B; = 0, and B, = O in the model 

price = By + B assess + Blotsize + B3sqrft + Bybdrms + u. 


The R-squared from estimating this model using the same 88 houses is .829. 
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(iv) If the variance of price changes with assess, lotsize, sqrft, or bdrms, what can you say about the 
F test from part (iii)? 


7 In Example 4.7, we used data on nonunionized manufacturing firms to estimate the relationship between 
the scrap rate and other firm characteristics. We now look at this example more closely and use all avail- 
able firms. 

(i) The population model estimated in Example 4.7 can be written as 


log(scrap) = By + B,hrsemp + Bolog(sales) + B;log(employ) + u. 
Using the 43 observations available for 1987, the estimated equation is 


jog(scrap) = 11.74 — .042 hrsemp — .951 log(sales) + .992 log( employ) 
(4.57) (.019) (.370) (.360) 
n = 43, R? = .310. 
Compare this equation to that estimated using only the 29 nonunionized firms in the sample. 
(ii) Show that the population model can also be written as 
log(scrap) = By + Byhrsemp + Bolog(sales/employ) + 0,log(employ) + u, 


where 0; = B, + B3. [Hint: Recall that log(x,/x;) = log(x) — log(x;).] Interpret the hypoth- 
esis Hy: 03 = 0. 
(iii) When the equation from part (ii) is estimated, we obtain 
a Tas 
log(scrap) = 11.74 — .042 hrsemp — .951 log(sales/employ) + .041 log(employ) 
(4.57) (.019) (.370) (.205) 
n = 43, R = 310. 


Controlling for worker training and for the sales-to-employee ratio, do bigger firms have larger 
statistically significant scrap rates? 
(iv) Test the hypothesis that a 1% increase in sales/employ is associated with a 1% drop in the scrap rate. 


8 Consider the multiple regression model with three independent variables, under the classical linear 
model assumptions MLR.1 through MLR.6: 


y = Bo + Bixi + Boxy + 3x3 + u. 


You would like to test the null hypothesis Hp: 6, — 3B, = 1. 

(G) Let Êi and Bo denote the OLS estimators of 6, and B. Find Var( B ‘a 3B) in terms of the vari- 
ances of Ê: and Ê- and the covariance between them. What is the standard error of B, = 3B,? 

(ii) Write the ¢ statistic for testing Hy: B; — 38, = 1. 

(iii) Define 0, = B, — 3B, and 6, = Êi = 3B. Write a regression equation involving Bp, 01, Bo, 
and B; that allows you to directly obtain 6, and its standard error. 


9 In Problem 3 in Chapter 3, we estimated the equation 


Poca 
sleep = 3,638.25 — .148 totwrk — 11.13 educ + 2.20 age 
(112.28) (.017) (5.88) (1.45) 
n = 706, R? = .113, 
where we now report standard errors along with the estimates. 
(i) Is either educ or age individually significant at the 5% level against a two-sided alternative? 


Show your work. 
Gi) Dropping educ and age from the equation gives 


10 


11 
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—_—~~ 
sleep = 3,586.38 — .151 totwrk 
(38.91) (.017) 
n = 706, R? = .103. 


Are educ and age jointly significant in the original equation at the 5% level? Justify your 
answer. 

(iii) Does including educ and age in the model greatly affect the estimated tradeoff between sleeping 
and working? 

(iv) Suppose that the sleep equation contains heteroskedasticity. What does this mean about the tests 
computed in parts (i) and (ii)? 


Regression analysis can be used to test whether the market efficiently uses information in valuing 
stocks. For concreteness, let return be the total return from holding a firm’s stock over the four-year 
period from the end of 1990 to the end of 1994. The efficient markets hypothesis says that these returns 
should not be systematically related to information known in 1990. If firm characteristics known at the 
beginning of the period help to predict stock returns, then we could use this information in choosing 
stocks. 

For 1990, let dkr be a firm’s debt to capital ratio, let eps denote the earnings per share, let netinc 
denote net income, and let salary denote total compensation for the CEO. 
(i) Using the data in RETURN, the following equation was estimated: 


—_—_ ~"- 
return = —14.37 + .321 dkr + .043 eps — .0051 nentinc + .0035 salary 
(6.89) (.201) (.078) (.0047) (.0022) 
n = 142, R? = .0395. 
Test whether the explanatory variables are jointly significant at the 5% level. Is any explanatory 
variable individually significant? 
(ii) Now, reestimate the model using the log form for netinc and salary: 
return = —36.30 + 327 dkr + .069 eps — 4.74 log(netinc) + 7.24 log(salary) 
(39.37) (.203) (.080) (3.39) (6.31) 
n = 142, R? = .0330. 
Do any of your conclusions from part (i) change? 
(iii) In this sample, some firms have zero debt and others have negative earnings. Should we try to 


use log(dkr) or log(eps) in the model to see if these improve the fit? Explain. 
(iv) Overall, is the evidence for predictability of stock returns strong or weak? 


The following table was created using the data in CEOSAL2, where standard errors are in parentheses 
below the coefficients: 


Dependent Variable: log(salary) 
Independent Variables (1) (2) (3) 
log(sales) 224 -158 .188 
(.027) (.040) (.040) 
log(mktval) — 112 .100 
(.050) (.049) 
profmarg — —.0023 —.0022 
(.0022) (0021) 
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Dependent Variable: log(salary) 

Independent Variables (1) (2) (3) 

ceoten _ == 0171 
(.0055) 

comten — — —.0092 
(.0033) 

intercept 4.94 4.62 4.57 

(0.20) (0.25) (0.25) 
Observations NT 177 177 
A-squared .281 304 353 


The variable mktval is market value of the firm, profmarg is profit as a percentage of sales, ceoten is 

years as CEO with the current company, and comten is total years with the company. 

(i) | Comment on the effect of profmarg on CEO salary. 

(ii) Does market value have a significant effect? Explain. 

(iii) Interpret the coefficients on ceoten and comten. Are these explanatory variables statistically 
significant? 

(iv) What do you make of the fact that longer tenure with the company, holding the other factors 
fixed, is associated with a lower salary? 


12 The following analysis was obtained using data in MEAP93, which contains school-level pass rates 
(as a percent) on a tenth-grade math test. 
(i) The variable expend is expenditures per student, in dollars, and math10 is the pass rate on the 
exam. The following simple regression relates math10 to lexpend = log(expend): 


ee ea 
math\10 = —69.34 + 11.16 lexpend 
(25:53) (3.17) 
n = 408, R? = .0297. 


Interpret the coefficient on /expend. In particular, if expend increases by 10%, what is the esti- 
mated percentage point change in math10? What do you make of the large negative intercept 
estimate? (The minimum value of /expend is 8.11 and its average value is 8.37.) 

(ii) Does the small R-squared in part (1) imply that spending is correlated with other factors 
affecting math10? Explain. Would you expect the R-squared to be much higher if expendi- 
tures were randomly assigned to schools—that is, independent of other school and student 
characteristics—trather than having the school districts determine spending? 

(iii) When log of enrollment and the percent of students eligible for the federal free lunch program 
are included, the estimated equation becomes 


—_— —- 
math\10 = —23.14 + 7.75 lexpend — 1.26 lenroll — .324 Inchprg 
(24.99) (3.04) (0.58) (0.36) 
n = 408, R? = .1893. 
Comment on what happens to the coefficient on /expend. Is the spending coefficient still statisti- 
cally different from zero? 


(iv) What do you make of the R-squared in part (iii)? What are some other factors that could be used 
to explain math10 (at the school level)? 


13 The data in MEAPSINGLE were used to estimate the following equations relating school-level performance 
on a fourth-grade math test to socioeconomic characteristics of students attending school. The variable free, 
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measured at the school level, is the percentage of students eligible for the federal free lunch program. The 
variable medinc is median income in the ZIP code, and pctsgle is percent of students not living with two 
parents (also measured at the ZIP code level). See also Computer Exercise C11 in Chapter 3. 
FM 
math4 = 96.77 — .833 pctsgle 
(1.60) (.071) 
n = 299, R? = .380 


ee 
math4 = 93.00 — .275 pctsgle — .402 free 
(1.63) (.117) (.070) 
n = 299, R? = .459 


—_—_—. 
math4 = 24.49 — .274 pctsgle — .422 free — .752 Imedinc + 9.01 lexppp 
(59.24) (.161) (.071) (5.358) (4.04) 
n = 299, R? = 472 


eT tite. 
math4 = 17.52 — .259 pctsgle — .420 free + 8.80 lexppp 
(32.25) (.117) (.070) (3.76) 
n = 299, R? = .472. 
(i) Interpret the coefficient on the variable pctsgle in the first equation. Comment on what happens 
when free is added as an explanatory variable. 
(ii) Does expenditure per pupil, entered in logarithmic form, have a statistically significant effect on 
performance? How big is the estimated effect? 
(iii) If you had to choose among the four equations as your best estimate of the effect of pctsgle and 
obtain a 95% confidence interval of Bpersgies Which would you choose? Why? 


Computer Exercises 


C1 The following model can be used to study whether campaign expenditures affect election outcomes: 
voteA = By + Bilog(expendA) + Blog(expendB) + BsprtystrA + u, 


where voteA is the percentage of the vote received by Candidate A, expendA and expendB are cam- 

paign expenditures by Candidates A and B, and prtystrA is a measure of party strength for Candidate A 

(the percentage of the most recent presidential vote that went to A’s party). 

(i) | What is the interpretation of 64? 

(ii) In terms of the parameters, state the null hypothesis that a 1% increase in A’s expenditures is 
offset by a 1% increase in B’s expenditures. 

(iii) Estimate the given model using the data in VOTE] and report the results in usual form. Do A’s 
expenditures affect the outcome? What about B’s expenditures? Can you use these results to test 
the hypothesis in part (ii)? 

(iv) Estimate a model that directly gives the f statistic for testing the hypothesis in part (ii). What do 
you conclude? (Use a two-sided alternative.) 


C2 Use the data in LAWSCHB8S for this exercise. 
(i) | Using the same model as in Problem 4 in Chapter 3, state and test the null hypothesis that the 
rank of law schools has no ceteris paribus effect on median starting salary. 
(ii) Are features of the incoming class of students—namely, LSAT and GPA—individually or jointly 
significant for explaining salary? (Be sure to account for missing data on LSAT and GPA.) 
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C3 


C4 


C5 


C6 


C7 


C8 


(iii) Test whether the size of the entering class (clsize) or the size of the faculty (faculty) needs to be 
added to this equation; carry out a single test. (Be careful to account for missing data on clsize 
and faculty.) 

(iv) What factors might influence the rank of the law school that are not included in the salary regression? 


Refer to Computer Exercise C2 in Chapter 3. Now, use the log of the housing price as the dependent 
variable: 


log(price) = By + B,sqrft + Bobdrms + u. 


(i) You are interested in estimating and obtaining a confidence interval for the percentage 
change in price when a 150-square-foot bedroom is added to a house. In decimal form, this is 
0, = 1506, + B. Use the data in HPRICE!1 to estimate 0,. 

(ii) Write $, in terms of 6, and PB, and plug this into the log(price) equation. 

Gii) Use part (ii) to obtain a standard error for 6, and use this standard error to construct a 95% 
confidence interval. 


In Example 4.9, the restricted version of the model can be estimated using all 1,388 observations in 
the sample. Compute the R-squared from the regression of bwght on cigs, parity, and faminc using all 
observations. Compare this to the R-squared reported for the restricted model in Example 4.9. 


Use the data in MLB1 for this exercise. 

(i) | Use the model estimated in equation (4.31) and drop the variable rbisyr. What happens to the 
statistical significance of hrunsyr? What about the size of the coefficient on hrunsyr? 

(ii) Add the variables runsyr (runs per year), fldperc (fielding percentage), and sbasesyr 
(stolen bases per year) to the model from part (i). Which of these factors are individually 
significant? 

(iii) In the model from part (ii), test the joint significance of bavg, fldperc, and sbasesyr. 


Use the data in WAGE? for this exercise. 
(i) | Consider the standard wage equation 


log(wage) = Bo + Byeduc + exper + B3tenure + u. 


State the null hypothesis that another year of general workforce experience has the same effect 
on log(wage) as another year of tenure with the current employer. 

(ii) Test the null hypothesis in part (i) against a two-sided alternative, at the 5% significance level, 
by constructing a 95% confidence interval. What do you conclude? 


Refer to the example used in Section 4-4. You will use the data set TWOYEAR. 

(i) The variable phsrank is the person’s high school percentile. (A higher number is better. For 
example, 90 means you are ranked better than 90 percent of your graduating class.) Find the 
smallest, largest, and average phsrank in the sample. 

(ii) Add phsrank to equation (4.26) and report the OLS estimates in the usual form. Is phsrank 
Statistically significant? How much is 10 percentage points of high school rank worth in terms 
of wage? 

(iii) Does adding phsrank to (4.26) substantively change the conclusions on the returns to two- 
and four-year colleges? Explain. 

(iv) The data set contains a variable called id. Explain why if you add id to equation (4.17) or (4.26) 
you expect it to be statistically insignificant. What is the two-sided p-value? 


The data set 401KSUBS contains information on net financial wealth (nettfa), age of the survey 
respondent (age), annual family income (inc), family size (fsize), and participation in certain pension 
plans for people in the United States. The wealth and income variables are both recorded in thousands 
of dollars. For this question, use only the data for single-person households (so fsize = 1). 

(1) | How many single-person households are there in the data set? 


c9 


C10 


C11 
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Gi) Use OLS to estimate the model 
nettfa = By + Biinc + Bage + u, 


and report the results using the usual format. Be sure to use only the single-person house- 
holds in the sample. Interpret the slope coefficients. Are there any surprises in the slope 
estimates? 

(iii) Does the intercept from the regression in part (ii) have an interesting meaning? Explain. 

(iv) Find the p-value for the test Hy: 6, = 1 against H,: 8B; < 1. Do you reject Ho at the 1% 
significance level? 

(v) If you do a simple regression of nettfa on inc, is the estimated coefficient on inc much different 
from the estimate in part (ii)? Why or why not? 


Use the data in DISCRIM to answer this question. (See also Computer Exercise C8 in Chapter 3.) 
(i) | Use OLS to estimate the model 


log(psoda) = By + B,prpbick + B,log(income) + Bsprppov + u, 


and report the results in the usual form. Is Ê; statistically different from zero at the 5% level 
against a two-sided alternative? What about at the 1% level? 

(ii) | What is the correlation between log(income) and prppov? Is each variable statistically signifi- 
cant in any case? Report the two-sided p-values. 

(iii) To the regression in part (i), add the variable log(hseval). Interpret its coefficient and report the 
two-sided p-value for Ho: Biog(hsevai) = O. 

(iv) In the regression in part (iii), what happens to the individual statistical significance of 
log(income) and prppov? Are these variables jointly significant? (Compute a p-value.) What do 
you make of your answers? 

(v) Given the results of the previous regressions, which one would you report as most reliable in 
determining whether the racial makeup of a zip code influences local fast-food prices? 


Use the data in ELEM94_95 to answer this question. The findings can be compared with those in 

Table 4.1. The dependent variable /avgsal is the log of average teacher salary and bs is the ratio of 

average benefits to average salary (by school). 

G) Run the simple regression of /avgsal on bs. Is the estimated slope statistically different from 
zero? Is it statistically different from —1? 

(ii) Add the variables /enrol and Istaff to the regression from part (i). What happens to the coeffi- 
cient on bs? How does the situation compare with that in Table 4.1? 

(iii) Why is the standard error on the bs coefficient smaller in part (ii) than in part (i)? (Hint: What 
happens to the error variance versus multicollinearity when Jenrol and Istaff are added?) 

(iv) How come the coefficient on /staff is negative? Is it large in magnitude? 

(v) Now add the variable lunch to the regression. Holding other factors fixed, are teachers being 
compensated for teaching students from disadvantaged backgrounds? Explain. 

(vi) Overall, is the pattern of results that you find with ELEM94_95 consistent with the pattern 
in Table 4.1? 


Use the data in HTV to answer this question. See also Computer Exercise C10 in Chapter 3. 
(i) Estimate the regression model 


educ = By) + B\motheduc + B,fatheduc + Babil + B,abil? + u 


by OLS and report the results in the usual form. Test the null hypothesis that educ is linearly 
related to abil against the alternative that the relationship is quadratic. 

(ii) Using the equation in part (i), test Hj: 8; = B» against a two-sided alternative. What is the 
p-value of the test? 

(iii) Add the two college tuition variables to the regression from part (i) and determine whether they 
are jointly statistically significant. 
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(iv) 


(v) 


What is the correlation between tuit17 and tuit18? Explain why using the average of the tuition 
over the two years might be preferred to adding each separately. What happens when you do use 
the average? 

Do the findings for the average tuition variable in part (iv) make sense when interpreted 
causally? What might be going on? 


C12 Use the data in ECONMATH to answer the following questions. 


G) 


(ii) 


(iii) 


(iv) 


Estimate a model explaining colgpa to hsgpa, actmth, and acteng. Report the results in the usual 
form. Are all explanatory variables statistically significant? a 
Consider an increase in hsgpa of one standard deviation, about .343. By how much does colgpa 
increase, holding actmth and acteng fixed. About how many standard deviations would the act- 
mth have to increase to change colgpa by the same amount as one standard deviation in hsgpa? 
Comment. 

Test the null hypothesis that actmth and acteng have the same effect (in the population) against 
a two-sided alternative. Report the p-value and describe your conclusions. 

Suppose the college admissions officer wants you to use the data on the variables in part (i) to 
create an equation that explains at least 50 percent of the variation in colgpa. What would you 
tell the officer? 


C13 Use the data set GPA1 to answer this question. It was used in Computer Exercise C13 in Chapter 3 to 
estimate the effect of PC ownership on college GPA. 


(i) 


(ii) 


(iii) 


Run the regression colGPA on PC, hsGPA, and ACT and obtain a 95% confidence interval 

for Bpc. Is the estimated coefficient statistically significant at the 5% level against a two-sided 
alternative? 

Discuss the statistical significance of the estimates Bless and Baer in part (i). Is hsGPA or ACT 
the more important predictor of colGPA? 

Add the two indicators fathcoll and mothcoll to the regression in part (i). Is either 

individually significant? Are they jointly statistically significant at the 5% level? 


C14 Use the data set JTRAIN98 to answer this question. 


G) 


(ii) 


(iii) 
(iv) 


Add the unemployment indicator unem96 to the regression reported in equation (4.52). Interpret 
its coefficient and discuss whether its sign and magnitude seem sensible. Is the estimate statisti- 
cally significant? 

What happens to the estimated job training effect compared with equation (4.52)? Is it still 
economically and statistically significant? 

Find the correlation between earn96 and unem96. Is it about what you would expect? Explain. 
Do you think your finding in part (iii) means you should drop unem96 from the regression? 
Explain. 


Multiple Regression 
Analysis: OLS 
Asymptotics 


n Chapters 3 and 4, we covered what are called finite sample, small sample, or exact properties of 


the OLS estimators in the population model 
y = Bo + Bix, + Bory + + Bey + u. [5.1] 


For example, the unbiasedness of OLS (derived in Chapter 3) under the first four Gauss-Markov 
assumptions is a finite sample property because it holds for any sample size n (subject to the mild 
restriction that n must be at least as large as the total number of parameters in the regression model, 
k + 1). Similarly, the fact that OLS is the best linear unbiased estimator under the full set of Gauss- 
Markov assumptions (MLR.1 through MLR.5) is a finite sample property. 

In Chapter 4, we added the classical linear model Assumption MLR.6, which states that the error 
term u is normally distributed and independent of the explanatory variables. This allowed us to derive 
the exact sampling distributions of the OLS estimators (conditional on the explanatory variables in 
the sample). In particular, Theorem 4.1 showed that the OLS estimators have normal sampling distri- 
butions, which led directly to the ¢ and F distributions for t and F statistics. If the error is not normally 
distributed, the distribution of a ¢ statistic is not exactly f, and an F statistic does not have an exact 
F distribution for any sample size. 

In addition to finite sample properties, it is important to know the asymptotic properties or large 
sample properties of estimators and test statistics. These properties are not defined for a particular 
sample size; rather, they are defined as the sample size grows without bound. Fortunately, under the 


assumptions we have made, OLS has satisfactory large sample properties. One practically important 
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finding is that even without the normality assumption (Assumption MLR.6), t and F statistics have 
approximately t and F distributions, at least in large sample sizes. We discuss this in more detail in 
Section 5-2, after we cover the consistency of OLS in Section 5-1. 

Because the material in this chapter is more difficult to understand, and because one can conduct 
empirical work without a deep understanding of its contents, this chapter may be skipped. However, 
we will necessarily refer to large sample properties of OLS when we study discrete response variables 
in Chapter 7, relax the homoskedasticity assumption in Chapter 8, and delve into estimation with time 
series data in Part 2. Furthermore, virtually all advanced econometric methods derive their justifica- 
tion using large-sample analysis, so readers who will continue into Part 3 should be familiar with the 


contents of this chapter. 


5-1 Consistency 


Unbiasedness of estimators, although important, cannot always be obtained. For example, as we dis- 
cussed in Chapter 3, the standard error of the regression, Ê, is not an unbiased estimator for ø, the 
standard deviation of the error u, in a multiple regression model. Although the OLS estimators are 
unbiased under MLR.1 through MLR.4, in Chapter 11 we will find that there are time series regres- 
sions where the OLS estimators are not unbiased. Further, in Part 3 of the text, we encounter several 
other estimators that are biased yet useful. 

Although not all useful estimators are unbiased, virtually all economists agree that consistency 
is a minimal requirement for an estimator. The Nobel Prize—winning econometrician Clive W. J. 
Granger once remarked, “If you can’t get it right as n goes to infinity, you shouldn’t be in this busi- 
ness.” The implication is that, if your estimator of a particular population parameter is not consistent, 
then you are wasting your time. 

There are a few different ways to describe consistency. Formal definitions and results are given in 
Math Refresher C; here, we focus on an intuitive understanding. For concreteness, let Ê; be the OLS 
estimator of 6; for some j. For each n, Ê; has a probability distribution (representing its possible val- 
ues in different random samples of size n). Because Bi is unbiased under Assumptions MLR.1 through 
MLR.4, this distribution has mean value £;. If this estimator is consistent, then the distribution of Ê; 
becomes more and more tightly distributed around £; as the sample size grows. As n tends to infinity, 
the distribution of Ê; y; collapses to the single point G;. In effect, this means that we can make our esti- 
mator arbitrarily close to B; if we can collect as much data as we want. This convergence is illustrated 
in Figure 5.1. 

Naturally, for any application, we have a fixed sample size, which is a major reason an asymp- 
totic property such as consistency can be difficult to grasp. Consistency involves a thought experi- 
ment about what would happen as the sample size gets large (while, at the same time, we obtain 
numerous random samples for each sample size). If obtaining more and more data does not generally 
get us closer to the parameter value of interest, then we are using a poor estimation procedure. 

Conveniently, the same set of assumptions implies both unbiasedness and consistency of OLS. 
We summarize with a theorem. 


111111 CONSISTENCY OF OLS 


Under Assumptions MLR.1 through MLR.4, the OLS estimator Ê; is consistent for §;, for all 
j/=0,1,...,k. 


CHAPTER 5 Multiple Regression Analysis: OLS Asymptotics 165 


FIGURE 5.1 Sampling distributions of By for sample sizes mn < Mm < m. 


A general proof of this result is most easily developed using the matrix algebra methods described 
in Appendices D and E. But we can prove Theorem 5.1 without difficulty in the case of the simple 
regression model. We focus on the slope estimator, B I 

The proof starts out the same as the proof of unbiasedness: we write down the formula for B 1 and 
then plug in y; = Bo + Bixa + u: 


where dividing both the numerator and denominator by n does not change the expression 
but allows us to directly apply the law of large numbers. When we apply the law of large 
numbers to the averages in the second part of equation (5.2), we conclude that the numerator 
and denominator converge in probability to the population quantities, Cov(x,,u) and Var(x,), 
respectively. Provided that Var(x,) # 0—which is assumed in MLR.3—we can use the properties of 
probability limits (see Math Refresher C) to get 


plim Ê; = Bı + Cov(x,,u)/Var(x,) 


= B, because Cov(x,,u) = 0. 


[5.2] 


[5.3] 


We have used the fact, discussed in Chapters 2 and 3, that E(u|x,) = 0 (Assumption MLR.4) implies 
that x, and u are uncorrelated (have zero covariance). 
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As a technical matter, to ensure that the probability limits exist, we should assume that 
Var(x,) < œ and Var(u) < œ (which means that their probability distributions are not too spread 
out), but we will not worry about cases where these assumptions might fail. Further, we could—and, 
in an advanced treatment of econometrics, we would—explicitly relax Assumption MLR.3 to rule 
out only perfect collinearity in the population. As stated, Assumption MLR.3 also disallows per- 
fect collinearity among the regressors in the sample we have at hand. Technically, for the thought 
experiment we can show consistency with no perfect collinearity in the population, allowing for the 
unlucky possibility that we draw a data set that does exhibit perfect collinearity. From a practical 
perspective the distinction is unimportant, as we cannot compute the OLS estimates for our sample 
if MLR.3 fails. 

The previous arguments, and equation (5.3) in particular, show that OLS is consistent in the sim- 
ple regression case if we assume only zero correlation. This is also true in the general case. We now 
state this as an assumption. 


Assumption MLR.4’ Zero Mean and Zero Correlation 


E(u) = Oland| Cov(y,u) = 0; for j= 1), 2,..., Kk. 


Assumption MLR.4’ is weaker than Assumption MLR.4 in the sense that the latter implies the 
former. One way to characterize the zero conditional mean assumption, E(ulx,, MAG Xz) = 0, is that 
any function of the explanatory variables is uncorrelated with u. Assumption MLR.4’ requires only 
that each x; is uncorrelated with u (and that u has a zero mean in the population). In Chapter 2, we 
actually motivated the OLS estimator for simple regression using Assumption MLR.4’, and the first 
order conditions for OLS in the multiple regression case, given in equation (3.13), are simply the sam- 
ple analogs of the population zero correlation assumptions (and zero mean assumption). Therefore, in 
some ways, Assumption MLR.4’ is more natural an assumption because it leads directly to the OLS 
estimates. Further, when we think about violations of Assumption MLR.4, we usually think in terms 
of Cov(x;,1) + 0 for some j. So how come we have used Assumption MLR.4 until now? There are 
two reasons, both of which we have touched on earlier. First, OLS turns out to be biased (but consist- 
ent) under Assumption MLR.4’ if E(ulx,, . . . , x,) depends on any of the x;. Because we have previ- 
ously focused on finite sample, or exact, sampling properties of the OLS estimators, we have needed 
the stronger zero conditional mean assumption. 

Second, and probably more important, is that the zero conditional mean assumption means that 
we have properly modeled the population regression function (PRF). That is, under Assumption 
MLR.4 we can write 


E(ylx1, tee > Xp) = Bo + Bix to + Bere 


and so we can obtain partial effects of the explanatory variables on the average or expected value 
of y. If we instead only assume Assumption MLR.4’, By + Bixi + =- + Bx; need not represent the 
PRF, and we face the possibility that some nonlinear functions of the Xj, such as i could be corre- 
lated with the error u. A situation like this means that we have neglected nonlinearities in the model 
that could help us better explain y; if we knew that, we would usually include such nonlinear func- 
tions. In other words, most of the time we hope to get a good estimate of the PRF, and so the zero 
conditional mean assumption is natural. Nevertheless, the weaker zero correlation assumption turns 
out to be useful in interpreting OLS estimation of a linear model as providing the best linear approx- 
imation to the PRF. It is also used in more advanced settings, such as in Chapter 15, where we have 
no interest in modeling a PRF. For further discussion of this somewhat subtle point, see Wooldridge 
(2010, Chapter 4) as well as Problem 6 at the end of this chapter. 
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5-1a Deriving the Inconsistency in OLS 


Just as failure of E(u|x,,...,x;,) = 0 causes bias in the OLS estimators, correlation between u and 
any Of x1, X2,..., x, generally causes all of the OLS estimators to be inconsistent. This simple but 
important observation is often summarized as: if the error is correlated with any of the independent 
variables, then OLS is biased and inconsistent. This is very unfortunate because it means that any bias 
persists as the sample size grows. 

In the simple regression case, we can obtain the inconsistency from the first part of equation (5.3), 
which holds whether or not u and x, are uncorrelated. The inconsistency in Êi (sometimes loosely 
called the asymptotic bias) is 


plim Ê; — Bı = Cov(x,,u)/Var(x,). [5.4] 


Because Var(x,) > 0, the inconsistency in Êi is positive if xı and u are positively correlated, and the 
inconsistency is negative if x, and u are negatively correlated. If the covariance between x, and u is 
small relative to the variance in x,, the inconsistency can be negligible; unfortunately, we cannot even 
estimate how big the covariance is because u is unobserved. 

We can use (5.4) to derive the asymptotic analog of the omitted variable bias (see Table 3.2 in 
Chapter 3). Suppose the true model, 


y = Bo + Bix, + Box + v, 


satisfies the first four Gauss-Markov assumptions. Then v has a zero mean and is uncorrelated with 
x, and x. If Bas Bi, and Bo denote the OLS estimators from the regression of y on x, and x, then 
Theorem 5.1 implies that these estimators are consistent. If we omit x, from the regression and do the 
simple regression of y on x,, then u = Bx, + v. Let B, denote the simple regression slope estimator. 
Then 


plim ğı = B, + B25), [5.5] 
where 
on = Cov(x,,%5)/Var(x;). [5.6] 


Thus, for practical purposes, we can view the inconsistency as being the same as the bias. The differ- 
ence is that the inconsistency is expressed in terms of the population variance of x, and the population 
covariance between x, and x, while the bias is based on their sample counterparts (because we condi- 
tion on the values of x, and x, in the sample). 

If x, and x, are uncorrelated (in the population), then 6; = 0, and B; is a consistent estimator of 
B, (although not necessarily unbiased). If x, has a positive partial effect on y, so that 6, > 0, and x, 
and x, are positively correlated, so that 5, > 0, then the inconsistency in B, is positive, and so on. 
We can obtain the direction of the inconsistency or asymptotic bias from Table 3.2. If the covariance 
between x, and x, is small relative to the variance of x,, the inconsistency can be small. 


Housing Prices and Distance from an Incinerator 


Let y denote the price of a house (price), let x, denote the distance from the house to a new trash 
incinerator (distance), and let x, denote the “quality” of the house (quality). The variable quality is 
left vague so that it can include things like size of the house and lot, number of bedrooms and bath- 
rooms, and intangibles such as attractiveness of the neighborhood. If the incinerator depresses house 
prices, then 6, should be positive: everything else being equal, a house that is farther away from the 
incinerator is worth more. By definition, 8, is positive because higher quality houses sell for more, 
other factors being equal. If the incinerator was built farther away, on average, from better homes, then 
distance and quality are positively correlated, and so 6, > 0. A simple regression of price on distance 
[or log(price) on log(distance)] will tend to overestimate the effect of the incinerator: B, + B46, > Bj. 
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An important point about inconsistency in OLS 
estimators is that, by definition, the problem does 
not go away by adding more observations to the 
sample. If anything, the problem gets worse with 
more data: the OLS estimator gets closer and closer 
to B, + B,6, as the sample size grows. 

Deriving the sign and magnitude of the incon- 
sistency in the general k regressor case is harder, 
just as deriving the bias is more difficult. We need 
to remember that if we have the model in equation 
(5.1) where, say, x, is correlated with u but the other 
independent variables are uncorrelated with u, all of 
the OLS estimators are generally inconsistent. For 
example, in the k = 2 case, 


y = Bo + Bix, + Boxy + u, 


suppose that x, and u are uncorrelated but x, and u are correlated. Then the OLS estimators B ı and LA 
will generally both be inconsistent. (The intercept will also be inconsistent.) The inconsistency in Ê- 
arises when x, and x, are correlated, as is usually the case. If x, and x, are uncorrelated, then any cor- 
relation between x, and u does not result in the inconsistency of Bo: plim Bo = £B,. Further, the incon- 
sistency in Êi is the same as in (5.4). The same statement holds in the general case: if x, is correlated 
with u, but x, and u are uncorrelated with the other independent variables, then only B ı is inconsistent, 
and the inconsistency is given by (5.4). The general case is very similar to the omitted variable case in 
Section 3A.4 of Appendix 3A. 


GOING FURTHER 5.1 
Suppose that the model 


score = By + B,Skipped + B.priGPA + u 


satisfies the first four Gauss-Markov 
assumptions, where score is score on a 
final exam, skipped is number of classes 
skipped, and priGPA is GPA prior to the 
current semester. If 8; is from the simple 
regression of score on skipped, what is the 
direction of the asymptotic bias in B;? 


5-2 Asymptotic Normality and Large Sample Inference 


Consistency of an estimator is an important property, but it alone does not allow us to perform 
statistical inference. Simply knowing that the estimator is getting closer to the population value 
as the sample size grows does not allow us to test hypotheses about the parameters. For test- 
ing, we need the sampling distribution of the OLS estimators. Under the classical linear model 
assumptions MLR.1 through MLR.6, Theorem 4.1 shows that the sampling distributions are nor- 
mal. This result is the basis for deriving the ¢ and F distributions that we use so often in applied 
econometrics. 

The exact normality of the OLS estimators hinges crucially on the normality of the distribution 
of the error, u, in the population. If the errors u, t2, . . . , U, are random draws from some distribution 
other than the normal, the Ê; will not be normally distributed, which means that the ¢ statistics will 
not have ż distributions and the F statistics will not have F distributions. This is a potentially serious 
problem because our inference hinges on being able to obtain critical values or p-values from the ¢ or 
F distributions. 

Recall that Assumption MLR.6 is equivalent to saying that the distribution of y given 
X1,X2,...,X, is normal. Because y is observed and u is not, in a particular application, it is much 
easier to think about whether the distribution of y is likely to be normal. In fact, we have already 
seen a few examples where y definitely cannot have a conditional normal distribution. A normally 
distributed random variable is symmetrically distributed about its mean, it can take on any posi- 
tive or negative value, and more than 95% of the area under the distribution is within two standard 
deviations. 


CHAPTER 5 Multiple Regression Analysis: OLS Asymptotics 169 


FIGURE 5.2 Histogram of prate using the data in 401K. 
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In Example 3.5, we estimated a model explaining the number of arrests of young men dur- 
ing a particular year (narr86). In the population, most men are not arrested during the year, and 
the vast majority are arrested one time at the most. (In the sample of 2,725 men in the data set 
CRIME 1, fewer than 8% were arrested more than once during 1986.) Because narrs6 takes on 
only two values for 92% of the sample, it cannot be close to being normally distributed in the 
population. 

In Example 4.6, we estimated a model explaining participation percentages (prate) in 401(k) 
pension plans. The frequency distribution (also called a histogram) in Figure 5.2 shows that the dis- 
tribution of prate is heavily skewed to the right, rather than being normally distributed. In fact, over 
40% of the observations on prate are at the value 100, indicating 100% participation. This violates the 
normality assumption even conditional on the explanatory variables. 

We know that normality plays no role in the unbiasedness of OLS, nor does it affect the conclu- 
sion that OLS is the best linear unbiased estimator under the Gauss-Markov assumptions. But exact 
inference based on ¢ and F statistics requires MLR.6. Does this mean that, in our prior analysis of 
prate in Example 4.6, we must abandon the f statistics for determining which variables are statisti- 
cally significant? Fortunately, the answer to this question is no. Even though the y; are not from a 
normal distribution, we can use the central limit theorem from Math Refresher C to conclude that the 
OLS estimators satisfy asymptotic normality, which means they are approximately normally distrib- 
uted in large enough sample sizes. 
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THEOREM 
5.2 


ASYMPTOTIC NORMALITY OF OLS 


Under the Gauss-Markov Assumptions MLR.1 through MLR.5, 

(i) Vn(B; — B)? Normal(0, o?/a?), where o?/a? > 0 is the asymptotic variance of Vn(B; — B;): 
for the slope coefficients, a? = plim(n~'>/.,??), where the 7, are the residuals from regressing x; on the 
other independent variables. We say that Ê; is asymptotically normally distributed (see Math Refresher C); 

(ii) & is a consistent estimator of a? = Var(u); 

(iii) For each j, 


(Ê; — B;)/sd(B;) © Normal(0,1) 


(Ê; — B,)/se(B,) ê Normal(0,1), 


where se(ĝ,) is the usual OLS standard error. 


The proof of asymptotic normality is somewhat complicated and is sketched in the appendix for 
the simple regression case. Part (ii) follows from the law of large numbers, and part (iii) follows from 
parts (i) and (ii) and the asymptotic properties discussed in Math Refresher C. 

Theorem 5.2 is useful because the normality Assumption MLR.6 has been dropped; the only 
restriction on the distribution of the error is that it has finite variance, something we will always 
assume. We have also assumed zero conditional mean (MLR.4) and homoskedasticity of u (MLR.5). 

In trying to understand the meaning of Theorem 5.2, it is important to keep separate the notions 
of the population distribution of the error term, u, and the sampling distributions of the Ê; as the 
sample size grows. A common mistake is to think that something is happening to the distribution of 
u—namely, that it is getting “closer” to normal—as the sample size grows. But remember that the 
population distribution is immutable and has nothing to do with the sample size. For example, we 
previously discussed narr86, the number of times a young man is arrested during the year 1986. The 
nature of this variable—it takes on small, nonnegative integer values—is fixed in the population. 
Whether we sample 10 men or 1,000 men from this population obviously has no effect on the popula- 
tion distribution. 

What Theorem 5.2 says is that, regardless of the population distribution of u, the OLS estima- 
tors, when properly standardized, have approximate standard normal distributions. This approxima- 
tion comes about by the central limit theorem because the OLS estimators involve—in a complicated 
way—the use of sample averages. Effectively, the sequence of distributions of averages of the under- 
lying errors is approaching normality for virtually any population distribution. 

Notice how the standardized Ê; has an asymptotic standard normal distribution whether we divide 
the difference Ê; — 6, by sd(ĝ;) (which we do not observe because it depends on ø) or by se(ĝ,) 
(which we can compute from our data because it depends on ô). In other words, from an asymptotic 
point of view it does not matter that we have to replace ø with ô. Of course, replacing o with & affects 
the exact distribution of the standardized Bi. We just saw in Chapter 4 that under the classical linear 
model assumptions, (Ê; - B;)/sd(B;) has an exact Normal(0,1) distribution and (Ê; - B;)/se(;) has 
an exact f,,—, distribution. 

How should we use the result in equation (5.7)? It may seem one consequence is that, if we are 
going to appeal to large-sample analysis, we should now use the standard normal distribution for 
inference rather than the ¢ distribution. But from a practical perspective it is just as legitimate to write 
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(Ê; - B;)/se(;) © ta- k-1 ~~ = lap [5.8] 


because tf, approaches the Normal(0,1) distribution as df gets large. Because we know under the 
CLM assumptions the ¢,_,_; distribution holds exactly, it makes sense to treat (Ê; - B;)/ se( Ê) as a 
t,-.-, random variable generally, even when MLR.6 does not hold. 

Equation (5.8) tells us that ¢ testing and the construction of confidence intervals are carried out 
exactly as under the classical linear model assumptions. This means that our analysis of dependent 
variables like prate and narr&6 does not have to change at all if the Gauss-Markov assumptions hold: 
in both cases, we have at least 1,500 observations, which is certainly enough to justify the approxima- 
tion of the central limit theorem. 

If the sample size is not very large, then the ¢ distribution can be a poor approximation to the 
distribution of the ¢ statistics when u is not normally distributed. Unfortunately, there are no general 
prescriptions on how big the sample size must be before the approximation is good enough. Some 
econometricians think that n = 30 is satisfactory, but this cannot be sufficient for all possible distribu- 
tions of u. Depending on the distribution of u, more observations may be necessary before the central 
limit theorem delivers a useful approximation. Further, the quality of the approximation depends not 
just on n, but on the df, n — k — 1: with more independent variables in the model, a larger sample 
size is usually needed to use the ft approximation. Methods for inference with small degrees of free- 
dom and nonnormal errors are outside the scope of this text. We will simply use the ż statistics as we 
always have without worrying about the normality assumption. 

It is very important to see that Theorem 5.2 does require the homoskedasticity assumption (along 
with the zero conditional mean assumption). If Var(y|x) is not constant, the usual f statistics and con- 
fidence intervals are invalid no matter how large the sample size is; the central limit theorem does not 
bail us out when it comes to heteroskedasticity. For this reason, we devote all of Chapter 8 to discuss- 
ing what can be done in the presence of heteroskedasticity. 

One conclusion of Theorem 5.2 is that G is a consistent estimator of go’; we already know 
from Theorem 3.3 that 6 is unbiased for o* under the Gauss-Markov assumptions. The consistency 
implies that ô is a consistent estimator of ø, which is important in establishing the asymptotic nor- 
mality result in equation (5.7). 

Remember that & appears in the standard error for each Ê. In fact, the estimated variance of Ê; is 

-~~ ô? 

Var(B;) SST — [5.9] 
where SST; is the total sum of squares of x; in the 
GOING FURTHER 5.2 sample, and Ri is the R-squared from regressing x; 
on all of the other independent variables. In Section 
3-4, we studied each component of (5.9), which we 
will now expound on in the context of asymptotic 


In a regression model with a large sample 
size, what is an approximate 95% confi- 


dence interval for Ê; under MLR.1 through ; : 3 i 
MLR.5? We call this an asymptotic confi- analysis. As the sample size grows, G~ converges in 
dence interval. probability to the constant ø”. Further, R? approaches 


a number strictly between zero and unity (so that 
1= R? converges to some number between zero and 
one). The sample variance of x; is SST;/n, and so SST, /n converges to Var(x;) as the sample size 


grows. ong means that SST; grows at approximately the same rate as the sample size: ze: SST; = no;, 


where oF is the population variance of x, When we combine these facts, we find that Var( Ê) shrinks 
to zero at the rate of 1/n; this is why inne sample sizes are better. 

When u is not normally distributed, the square root of (5.9) is sometimes called the asymptotic 
standard error, and f statistics are called asymptotic ¢ statistics. Because these are the same quanti- 
ties we dealt with in Chapter 4, we will just call them standard errors and ż statistics, with the under- 
standing that sometimes they have only large-sample justification. A similar comment holds for an 
asymptotic confidence interval constructed from the asymptotic standard error. 
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Using the preceding argument about the estimated variance, we can write 
se(B;) = cl Vn, [5.10] 


where c; is a positive constant that does not depend on the sample size. In fact, the constant c; can be 
shown to be 


o 
CG = 


0 oMi- pp 


where ø = sd(u), 0; = sd(x;), and p} is the population R-squared from regressing x; on the other 
explanatory variables. Just like studying equation (5.9) to see which variables affect Var(B;) under 
the Gauss-Markov assumptions, we can use this expression for c; to study the impact of larger 
error standard deviation (a), more population variation in x; (o;), and multicollinearity in the popula- 
tion (p7). 

Equation (5.10) is only an approximation, but it is a useful rule of thumb: standard errors can be 
expected to shrink at a rate that is the inverse of the square root of the sample size. 


Standard Errors in a Birth Weight Equation 


We use the data in BWGHT to estimate a relationship where log of birth weight is the dependent 
variable, and cigarettes smoked per day (cigs) and log of family income are independent variables. 
The total number of observations is 1,388. Using the first half of the observations (694), the stand- 
ard error for Poss is about .0013. The standard error using all of the observations is about .00086. 
The ratio of the latter standard error to the former is .00086/.0013 ~ .662. This is pretty close to 
V 694/1,388 = .707, the ratio obtained from the approximation in (5.10). In other words, equation 
(5.10) implies that the standard error using the larger sample size should be about 70.7% of the stand- 
ard error using the smaller sample. This percentage is pretty close to the 66.2% we actually compute 
from the ratio of the standard errors. 


The asymptotic normality of the OLS estimators also implies that the F statistics have approxi- 
mate F distributions in large sample sizes. Thus, for testing exclusion restrictions or other multiple 
hypotheses, nothing changes from what we have done before. 


5-2a Other Large Sample Tests: The Lagrange Multiplier Statistic 


Once we enter the realm of asymptotic analysis, other test statistics can be used for hypothesis testing. 
For most purposes, there is little reason to go beyond the usual ¢ and F statistics: as we just saw, these 
statistics have large sample justification without the normality assumption. Nevertheless, sometimes 
it is useful to have other ways to test multiple exclusion restrictions, and we now cover the Lagrange 
multiplier (LM) statistic, which has achieved some popularity in modern econometrics. 

The name “Lagrange multiplier statistic” comes from constrained optimization, a topic beyond 
the scope of this text. [See Davidson and MacKinnon (1993).] The name score statistic—which also 
comes from optimization using calculus—is used as well. Fortunately, in the linear regression frame- 
work, it is simple to motivate the LM statistic without delving into complicated mathematics. 

The form of the LM statistic we derive here relies on the Gauss-Markov assumptions, the same 
assumptions that justify the F statistic in large samples. We do not need the normality assumption. 

To derive the LM statistic, consider the usual multiple regression model with k independent 
variables: 


y = Po + Bix +o + Bey +u. [5.11] 
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We would like to test whether, say, the last g of these variables all have zero population parameters: 
the null hypothesis is 


Ho: Br-q+1 = 0, eia Be = 0, [5.12] 


which puts q exclusion restrictions on the model (5.11). As with F testing, the alternative to (5.12) is 
that at least one of the parameters is different from zero. 

The LM statistic requires estimation of the restricted model only. Thus, assume that we have run 
the regression 


y = Bo + Bim ++ Be Aig + Ñ, [5.13] 


where indicates that the estimates are from the restricted model. In particular, 7 indicates the 
residuals from the restricted model. (As always, this is just shorthand to indicate that we obtain the 
restricted residual for each observation in the sample.) 

If the omitted variables x,_,,, through x, truly have zero population coefficients, then, at least 
approximately, # should be uncorrelated with each of these variables in the sample. This suggests 
running a regression of these residuals on those independent variables excluded under Hy, which is 
almost what the LM test does. However, it turns out that, to get a usable test statistic, we must include 
all of the independent variables in the regression. (We must include all regressors because, in general, 
the omitted regressors in the restricted model are correlated with the regressors that appear in the 
restricted model.) Thus, we run the regression of 


ce 99 


vu on Xis Xs » > » 3 Xk [5.14] 


This is an example of an auxiliary regression, a regression that is used to compute a test statistic but 
whose coefficients are not of direct interest. 

How can we use the regression output from (5.14) to test (5.12)? If (5.12) is true, the R-squared 
from (5.14) should be “close” to zero, subject to sampling error, because % will be approximately 
uncorrelated with all the independent variables. The question, as always with hypothesis testing, is 
how to determine when the statistic is large enough to reject the null hypothesis at a chosen sig- 
nificance level. It turns out that, under the null hypothesis, the sample size multiplied by the usual 
R-squared from the auxiliary regression (5.14) is distributed asymptotically as a chi-square random 
variable with g degrees of freedom. This leads to a simple procedure for testing the joint significance 
of a set of g independent variables. 


The Lagrange Multiplier Statistic for q Exclusion 
Restrictions: 


(i) Regress y on the restricted set of independent variables and save the residuals, 7. 


(ii) Regress 7% on all of the independent variables and obtain the R-squared, say, R? (to distinguish it 
from the R-squareds obtained with y as the dependent variable). 


(iii) Compute LM = nR? [the sample size times the R-squared obtained from step (ii)]. 


(iv) Compare LM to the appropriate critical value, c, in a XG distribution; if LM > c, the null hypothe- 
sis is rejected. Even better, obtain the p-value as the probability that a XG random variable exceeds 
the value of the test statistic. If the p-value is less than the desired significance level, then Hp is 
rejected. If not, we fail to reject Hy. The rejection rule is essentially the same as for F testing. 


Because of its form, the LM statistic is sometimes referred to as the n-R-squared statistic. 
Unlike with the F statistic, the degrees of freedom in the unrestricted model plays no role in carrying 
out the LM test. All that matters is the number of restrictions being tested (q), the size of the auxiliary 
R-squared (R?), and the sample size (n). The df in the unrestricted model plays no role because of the 
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asymptotic nature of the LM statistic. But we must be sure to multiply RZ by the sample size to obtain 
LM; a seemingly low value of the R-squared can still lead to joint significance if n is large. 

Before giving an example, a word of caution is in order. If in step (i), we mistakenly regress y 
on all of the independent variables and obtain the residuals from this unrestricted regression to be 
used in step (ii), we do not get an interesting statistic: the resulting R-squared will be exactly zero! 
This is because OLS chooses the estimates so that the residuals are uncorrelated in samples with all 
included independent variables [see equations in (3.13)]. Thus, we can only test (5.12) by regressing 
the restricted residuals on all of the independent variables. (Regressing the restricted residuals on the 
restricted set of independent variables will also produce R? = 0.) 


Economic Model of Crime 
We illustrate the LM test by using a slight extension of the crime model from Example 3.5: 
narr&S6 = By + Bipcnv + B,avgsen + B3tottime + Byptimes6 + Bsgemp86 + u, 
where 


narr86 = the number of times a man was arrested. 
pcnv = the proportion of prior arrests leading to conviction. 
avgsen = average sentence served from past convictions. 
tottime = total time the man has spent in prison prior to 1986 since reaching the age of 18. 
ptimeS6 = months spent in prison in 1986. 
gemp86 = number of quarters in 1986 during which the man was legally employed. 


We use the LM statistic to test the null hypothesis that avgsen and tottime have no effect on 
narrés6 once the other factors have been controlled for. 

In step (i), we estimate the restricted model by regressing narr86 on pcnv, ptime86, and gemp86; 
the variables avgsen and tottime are excluded from this regression. We obtain the residuals 7 from 
this regression, 2,725 of them. Next, we run the regression of 


ti on penv, ptime86, qemp86, avgsen, and tottime; [5.15] 


as always, the order in which we list the independent variables is irrelevant. This second regression 
produces RŽ, which turns out to be about .0015. This may seem small, but we must multiply it by n 
to get the LM statistic: LM = 2,725(.0015) ~ 4.09. The 10% critical value in a chi-square distribu- 
tion with two degrees of freedom is about 4.61 (rounded to two decimal places; see Table G.4). Thus, 
we fail to reject the null hypothesis that Bavgsen = O and Bionime = O at the 10% level. The p-value is 
P(y3 > 4.09) = .129, so we would reject Hy at the 15% level. 

As a comparison, the F test for joint significance of avgsen and tottime yields a p-value of about 
.131, which is pretty close to that obtained using the LM statistic. This is not surprising because, 
asymptotically, the two statistics have the same probability of Type I error. (That is, they reject the 


null hypothesis with the same frequency when the null is true.) 


As the previous example suggests, with a large sample, we rarely see important discrepancies 
between the outcomes of LM and F tests. We will use the F statistic for the most part because it is 
computed routinely by most regression packages. But you should be aware of the LM statistic as it is 
used in applied work. 

One final comment on the LM statistic. As with the F statistic, we must be sure to use the same 
observations in steps (i) and (ii). If data are missing for some of the independent variables that are 
excluded under the null hypothesis, the residuals from step (i) should be obtained from a regression 
on the reduced data set. 
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5-3 Asymptotic Efficiency of OLS 


We know that, under the Gauss-Markov assumptions, the OLS estimators are best linear unbiased. 
OLS is also asymptotically efficient among a certain class of estimators under the Gauss-Markov 
assumptions. A general treatment requires matrix algebra and advanced asymptotic analysis. First, we 
describe the result in the simple regression case. 

In the model 


y = Po + Bix + u, [5.16] 


u has a zero conditional mean under MLR.4: E(u|x) = 0. This opens up a variety of consistent esti- 
mators for By and B,; as usual, we focus on the slope parameter, 6,. Let g(x) be any function of x; for 
example, g(x) = x’ or g(x) = 1/(1 + |x|). Then u is uncorrelated with g(x) (see Property CE.5 in 
Math Refresher B). Let z; = g(x;) for all observations i. Then the estimator 


Be (St z 2) / (3 Ge 2x) 5.17] 


is consistent for 6,, provided g(x) and x are correlated. [Remember, it is possible that g(x) and x 
are uncorrelated because correlation measures linear dependence.] To see this, we can plug in 
y; = Bo + Bix; + u; and write B, as 


Řı = pirt Gmc = du) [mB z Zn). [5.18] 
i=1 i=1 
Now, we can apply the law of large numbers to the numerator and denominator, which converge in 
probability to Cov(z,u) and Cov(z,x), respectively. Provided that Cov(z,u) # 0—so that z and x are 
correlated—we have 


plim B, = B, + Cov(z,u)/Cov(z,x) = By, 


because Cov(z,u) = 0 under MLR.4. 

It is more difficult to show that 6, is asymptotically normal. Nevertheless, using arguments simi- 
lar to those in the appendix, it can be shown that Vn(6, — B;) is asymptotically normal with mean 
zero and asymptotic variance a? Var(z)/[Cov(z,x) }?. The asymptotic variance of the OLS estimator 
is obtained when z = x, in which case, Cov(z,x) = Cov(x,x) = Var(x). Therefore, the asymptotic 
variance of Vn(B, — B,), where B, is the OLS estimator, is ?Var(x)/[Var(x) P = 07/Var(x).Now, 
the Cauchy-Schwartz inequality (see Math Refresher B.4) implies that [Cov(z,x) P < Var(z)Var(x), 
which implies that the asymptotic variance of Vn( Êi — B,) is no larger than that of Vn(B, — B,). 
We have shown in the simple regression case that, under the Gauss-Markov assumptions, the OLS 
estimator has a smaller asymptotic variance than any estimator of the form (5.17). [The estimator 
in (5.17) is an example of an instrumental variables estimator, which we will study extensively in 
Chapter 15.] If the homoskedasticity assumption fails, then there are estimators of the form (5.17) that 
have a smaller asymptotic variance than OLS. We will see this in Chapter 8. 

The general case is similar but much more difficult mathematically. In the k regressor case, the 
class of consistent estimators is obtained by generalizing the OLS first order conditions: 


> g(x) (yi Bo Bixa ai Bixi) =O= 0, krek [5.19] 
i=l 
where g(x) denotes any function of all explanatory variables for observation i. As can be seen by 
comparing (5.19) with the OLS first order conditions in (3.13), we obtain the OLS estimators when 
go(x;) = 1 and g;(x;) = x, for j = 1,2,...,k. There are infinitely many estimators that can be 
defined using the equations in (5.19) because we can use any functions of the x; that we want. 
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1 14'):14 ASYMPTOTIC EFFICIENCY OF OLS 


5.3 Under the Gauss-Markov assumptions, let B; denote estimators that solve equations of the form (5.19) 


and let Ê; denote the OLS estimators. Then for j = 0, 1, 2, . . . , k, the OLS estimators have the small- 
est asymptotic variances: Avar Vn(ĝ; — B) = AvarVn(B; — B)). 


Proving consistency of the estimators in (5.19), let alone showing they are asymptotically normal, 
~] is mathematically difficult. See Wooldridge (2010, Chapter 5). 


Summary 


The claims underlying the material in this chapter are fairly technical, but their practical implications are 
straightforward. We have shown that the first four Gauss-Markov assumptions imply that OLS is consist- 
ent. Furthermore, all of the methods of testing and constructing confidence intervals that we learned in 
Chapter 4 are approximately valid without assuming that the errors are drawn from a normal distribution 
(equivalently, the distribution of y given the explanatory variables is not normal). This means that we can 
apply OLS and use previous methods for an array of applications where the dependent variable is not even 
approximately normally distributed. We also showed that the LM statistic can be used instead of the F sta- 
tistic for testing exclusion restrictions. 

Before leaving this chapter, we should note that examples such as Example 5.3 may very well have 
problems that do require special attention. For a variable such as narr86, which is zero or one for most men 
in the population, a linear model may not be able to adequately capture the functional relationship between 
narrs6 and the explanatory variables. Moreover, even if a linear model does describe the expected value of 
arrests, heteroskedasticity might be a problem. Problems such as these are not mitigated as the sample size 
grows, and we will return to them in later chapters. 


Key Terms 


Asymptotic Bias Asymptotic t Statistics Lagrange Multiplier (LM) 
Asymptotic Confidence Asymptotic Variance Statistic 

Interval Asymptotically Efficient Large Sample Properties 
Asymptotic Normality Auxiliary Regression n-R-Squared Statistic 
Asymptotic Properties Consistency Score Statistic 
Asymptotic Standard Error Inconsistency 


Problems 


1 In the simple regression model under MLR.1 through MLR.4, we argued that the slope estimator, Ê}, 
is consistent for B,. Using Bo =y- Bi, show that plim Bo = Bp. [You need to use the consistency of 
Ê, and the law of large numbers, along with the fact that By = E(y) — B,E(x)).] 


2 Suppose that the model 
petstck = By + B funds + Brisktol + u 


satisfies the first four Gauss-Markov assumptions, where pctstck is the percentage of a worker’s 
pension invested in the stock market, funds is the number of mutual funds that the worker can 
choose from, and risktol is some measure of risk tolerance (larger risktol means the person has a 
higher tolerance for risk). If funds and risktol are positively correlated, what is the inconsistency 
in B,, the slope coefficient in the simple regression of pctstck on funds? 
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3 The data set SMOKE contains information on smoking behavior and other variables for a random 
sample of single adults from the United States. The variable cigs is the (average) number of cigarettes 
smoked per day. Do you think cigs has a normal distribution in the U.S. adult population? Explain. 


4 In the simple regression model (5.16), under the first four Gauss-Markov assumptions, we showed that 
estimators of the form (5.17) are consistent for the slope, 8,;. Given such an estimator, define an esti- 
mator of By by By = y — B,x. Show that plim By = Bo. 


5 The following histogram was created using the variable score in the data file ECONMATH. Thirty bins 
were used to create the histogram, and the height of each cell is the proportion of observations falling 
within the corresponding interval. The best-fitting normal distribution—that is, using the sample mean 
and sample standard deviation—has been superimposed on the histogram. 
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course score (in percentage form) 


(i) If you use the normal distribution to estimate the probability that score exceeds 100, would the 
answer be zero? Why does your answer contradict the assumption of a normal distribution for 
score? 


(ii) Explain what is happening in the left tail of the histogram. Does the normal distribution fit well 
in the left tail? 


6 Consider the equation 


where the explanatory variable x has a standard normal distribution in the population. In particular, 
E(x) = 0, E(x?) = Var(x) = 1, and E(x?) = 0. This last condition holds because the standard normal 
distribution is symmetric about zero. We want to study what we can say about the OLS estimator of B, we 
omit x” and compute the simple regression estimator of the intercept and slope. 

(i) Show that we can write 


y = ay t Bx +v 


where E(v) = 0. In particular, find v and the new intercept, ap. 
Gi) Show that E(vlx) depends on x unless B, = 0. 
Gii) Show that Cov(x, v) = 0. 
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(iv) If Êi is the slope coefficient from regression y; on x;, is Ê ı consistent for B,? Is it unbiased? 
Explain. 

(v) Argue that being able to estimate 6, has some value in the following sense: £, is the partial 
effect of x on y evaluated at x = 0, the average value of x. 

(vi) Explain why being able to consistently estimate 6, and $, is more valuable than just estimating B,. 


Computer Exercises 


C1 


C2 


C3 


C4 


Use the data in WAGE] for this exercise. 
(i) Estimate the equation 


wage = By + Byeduc + Byexper + B3tenure + u. 


Save the residuals and plot a histogram. 

(ii) Repeat part (i), but with log(wage) as the dependent variable. 

(iii) Would you say that Assumption MLR.6 is closer to being satisfied for the level-level model or 
the log-level model? 


Use the data in GPA2 for this exercise. 
(i) Using all 4,137 observations, estimate the equation 


colgpa = By + By, hsperc + Basat + u 


and report the results in standard form. 

(ii) | Reestimate the equation in part (i), using the first 2,070 observations. 

(iii) Find the ratio of the standard errors on hsperc from parts (i) and (ii). Compare this with the 
result from (5.10). 


In equation (4.42) of Chapter 4, using the data set BWGHT, compute the LM statistic for testing 
whether motheduc and fatheduc are jointly significant. In obtaining the residuals for the restricted 
model, be sure that the restricted model is estimated using only those observations for which all vari- 
ables in the unrestricted model are available (see Example 4.9). 


Several statistics are commonly used to detect nonnormality in underlying population distributions. 
Here we will study one that measures the amount of skewness in a distribution. Recall that any nor- 
mally distributed random variable is symmetric about its mean; therefore, if we standardize a sym- 
metrically distributed random variable, say z = (y — y)/o,, where u, = E(y) and g, = sd(y), then 
z has mean zero, variance one, and E(z*) = 0. Given a sample of data {y; i = 1, ... , n}, we can stan- 
dardize y; in the sample by using z; = (y; — pi 6, where /1, is the sample mean and G, is the sample 
standard deviation. (We ignore the fact that these are estimates based on the sample.) A sample statistic 
that measures skewness is n~'>7_,z?, or where n is replaced with (n —1) as a degrees-of-freedom 
adjustment. If y has a normal distribution in the population, the skewness measure in the sample for the 
standardized values should not differ significantly from zero. 

(i) First use the data set 401 KSUBS, keeping only observations with fsize = 1. Find the skewness 
measure for inc. Do the same for log(inc). Which variable has more skewness and therefore 
seems less likely to be normally distributed? 

(ii) Next use BWGHT2. Find the skewness measures for bwght and log(bwght). What do you 
conclude? 

(iii) Evaluate the following statement: “The logarithmic transformation always makes a positive 
variable look more normally distributed.” 

(iv) If we are interested in the normality assumption in the context of regression, should we be 
evaluating the unconditional distributions of y and log(y)? Explain. 
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C5 Consider the analysis in Computer Exercise C11 in Chapter 4 using the data in HTV, where educ is the 
dependent variable in a regression. 
(i) How many different values are taken on by educ in the sample? Does educ have a continuous 
distribution? 
(i) Plot a histogram of educ with a normal distribution overlay. Does the distribution of educ 
appear anything close to normal? 
(iii) Which of the CLM assumptions seems clearly violated in the model 


educ = By) + B,\motheduc + B,fatheduc + Babil + B,abil? + u? 


How does this violation change the statistical inference procedures carried out in Computer 
Exercise C11 in Chapter 4? 


C6 Use the data in ECONMATH to answer this question. 
(i) Logically, what are the smallest and largest values that can be taken on by the variable score? 
What are the smallest and largest values in the sample? 
(ii) Consider the linear model 


score = By + B,colgpa + B,actmth + Bzacteng + u. 


Why cannot Assumption MLR.6 hold for the error term u? What consequences does this have 
for using the usual ¢ statistic to test Hy: B; = 0? 

(iii) Estimate the model from part (ii) and obtain the f statistic and associated p-value for testing 
Hy: £; = 0. How would you defend your findings to someone who makes the following state- 
ment: “You cannot trust that p-value because clearly the error term in the equation cannot have 
a normal distribution.” 


APPENDIX 5A 


Asymptotic Normality of OLS 


We sketch a proof of the asymptotic normality of OLS [Theorem 5.2(i)] in the simple regression 
case. Write the simple regression model as in equation (5.16). Then, by the usual algebra of simple 
regression, we can write 


Vali = Bi) = (s)| nPE (x; — x) ui], 
i=l 
where we use s? to denote the sample variance of {x; i = 1,2,...,n}. By the law of large num- 
bers (see Math Refresher C), s? 5 oy = Var(x). Assumption MLR.3 rules out perfect collinearity, 
which means that Var(x) > 0 (x; varies in the sample, and therefore x is not constant in the popula- 
tion). Next, n X; — x)u; = 0? D7 (a; uu + (u — yin Eu], where u = E(x) 
is the population mean of x. Now {u;} is a sequence of i.i.d. random variables with mean zero and 
variance g°, and so n '>)"_,u; converges to the Normal(0,a7) distribution as n > œ; this is just 
the central limit theorem from Math Refresher C. By the law of large numbers, plim(u — x) = 0. 
A standard result in asymptotic theory is that if plim(w,) = 0 and z, has an asymptotic normal 
distribution, then plim(w,z,) = 0. [See Wooldridge (2010, Chapter 3) for more discussion.] This 
implies that (u — x)[n'?>/_\u;] has zero plim. Next, {(x; — w)u i = 1,2,...} is an indefi- 
nite sequence of i.i.d. random variables with mean zero—because u and x are uncorrelated under 
Assumption MLR.4—and variance o*a% by the homoskedasticity Assumption MLR.5. Therefore, 
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n'?S"_ (x; — w)u; has an asymptotic Normal(0,o0°02) distribution. We just showed that the 
difference between n Xi (x; — x)u,; andn X; (x; — jw)u; has zero plim. A result in asymp- 
totic theory is that if z, has an asymptotic normal distribution and plim(v, — z„) = 0, then v, has the 
same asymptotic normal distribution as z,,. It follows that n~'”>7_,(x; — x)u; also has an asymptotic 


Normal(0,o0°o2) distribution. Putting all of the pieces together gives 
ValBi = pi) = (102) [aÈ C = Du 
i=l 


+ [(1/sz) — (Vode? S Cx = Du 
i=l 
and because plim(1/s2) = 1/2, the second term has zero plim. Therefore, the asymptotic distri- 
bution of Vn(B, — B,) is Normal(0,{o?02}/{o2}*) = Normal(0,o7/02). This completes the proof 
in the simple regression case, as aj = o in this case. See Wooldridge (2010, Chapter 4) for the 
general case. 


CHAPTER 6 = = 


Multiple Regression 
Analysis: Further Issues 


his chapter brings together several issues in multiple regression analysis that we could 

not conveniently cover in earlier chapters. These topics are not as fundamental as the material 

in Chapters 3 and 4, but they are important for applying multiple regression to a broad range 
of empirical problems. 


6-1 Effects of Data Scaling on OLS Statistics 


In Chapter 2 on bivariate regression, we briefly discussed the effects of changing the units of mea- 
surement on the OLS intercept and slope estimates. We also showed that changing the units of mea- 
surement did not affect R-squared. We now return to the issue of data scaling and examine the effects 
of rescaling the dependent or independent variables on standard errors, f statistics, F statistics, and 
confidence intervals. 

We will discover that everything we expect to happen, does happen. When variables are rescaled, 
the coefficients, standard errors, confidence intervals, t statistics, and F statistics change in ways that 
preserve all measured effects and testing outcomes. Although this is no great surprise—in fact, we 
would be very worried if it were not the case—it is useful to see what occurs explicitly. Often, data 
scaling is used for cosmetic purposes, such as to reduce the number of zeros after a decimal point in 
an estimated coefficient. By judiciously choosing units of measurement, we can improve the appear- 
ance of an estimated equation while changing nothing that is essential. 

We could treat this problem in a general way, but it is much better illustrated with examples. 
Likewise, there is little value here in introducing an abstract notation. 
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We begin with an equation relating infant birth weight to cigarette smoking and family income: 


—_—_— — a n a 
bwght = By + B,cigs + P, faminc, [6.1] 
where 


bwght = child birth weight, in ounces. 
cigs = number of cigarettes smoked by the mother while pregnant, per day. 
faminc = annual family income, in thousands of dollars. 


The estimates of this equation, obtained using the data in BWGHT, are given in the first column of 
Table 6.1. Standard errors are listed in parentheses. The estimate on cigs says that if a woman smoked 
five more cigarettes per day, birth weight is predicted to be about .4634(5) = 2.317 ounces less. The 
t statistic on cigs is —5.06, so the variable is very statistically significant. 

Now, suppose that we decide to measure birth weight in pounds, rather than in ounces. Let 
bwghtlbs = bwght/16 be birth weight in pounds. What happens to our OLS statistics if we use this 
as the dependent variable in our equation? It is easy to find the effect on the coefficient estimates by 
simple manipulation of equation (6.1). Divide this entire equation by 16: 


—_ a n a 
bwght/16 = Ba/16 + (B,/16)cigs + (B3/16)faminc. 


Because the left-hand side is birth weight in pounds, it follows that each new coefficient will be the 
corresponding old coefficient divided by 16. To verify this, the regression of bwghtlbs on cigs, and 
faminc is reported in column (2) of Table 6.1. Up to the reported digits (and any digits beyond), 
the intercept and slopes in column (2) are just those in column (1) divided by 16. For example, the 
coefficient on cigs is now —.0289; this means that if cigs were higher by five, birth weight would be 
.0289(5) = .1445 pounds lower. In terms of ounces, we have .1445(16) = 2.312, which is slightly 
different from the 2.317 we obtained earlier due to rounding error. The point is, after the effects are 
transformed into the same units, we get exactly the same answer, regardless of how the dependent 
variable is measured. 

What about statistical significance? As we expect, changing the dependent variable from ounces 
to pounds has no effect on how statistically important the independent variables are. The standard 
errors in column (2) are 16 times smaller than those in column (1). A few quick calculations show 


TABLE 6.1 Effects of Data Scaling 


Dependent Variable (1) bwght (2) bwghtlbs (3) bwght 
Independent Variables | o O 


cigs 


packs —9.268 
(1.832) 


faminc .0927 .0058 .0927 
(.0292) (.0018) (.0292) 


intercept 116.974 7.3109 116.974 
(1.049) (.0656) (1.049) 


Observations 1,388 1,388 1,388 
R-Squared .0298 .0298 .0298 
SSR 557,485.51 2,177.6778 557,485.51 
SER 20.063 1.2539 20.063 
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that the ¢ statistics in column (2) are indeed identical to the ¢ statistics in column (1). The endpoints 
for the confidence intervals in column (2) are just the endpoints in column (1) divided by 16. This 
is because the CIs change by the same factor as the standard errors. [Remember that the 95% CI here 
is Ê; + 1.96 se(B;).] 

In terms of goodness-of-fit, the R-squareds from the two regressions are identical, as should be 
the case. Notice that the sum of squared residuals (SSR) and the standard error of the regression 
(SER) do differ across equations. These differences are easily explained. Let i; denote the residual for 
observation i in the original equation (6.1). Then the residual when bwghtlbs is the dependent variable 
is simply ĉ;/16. Thus, the squared residual in the second equation is (@;/16)* = a7/256. This is why 
the SSR in column (2) is equal to the SSR in column (1) divided by 256. 

Because SER = 6 = VSSR/(n — k — 1) = VSSR/1,385, the SER in column (2) is 16 times 
smaller than that in column (1). Another way to think about this is that the error in the equation with 
bwghtlbs as the dependent variable has a standard deviation 16 times smaller than the standard devia- 
tion of the original error. This does not mean that we have reduced the error by changing how birth 
weight is measured; the smaller SER simply reflects a difference in units of measurement. 

Next, let us return the dependent variable to its original units: bwght is measured in ounces. 
Instead, let us change the unit of measurement of one of the independent variables, cigs. Define packs 
to be the number of packs of cigarettes smoked per day. Thus, packs = cigs/20. What happens to the 
coefficients and other OLS statistics now? Well, we can write 


—_— ~~ A a a a a a 
bwght = Bo + (208,)(cigs/20) + B» famine = By + (208,)packs + p, faminc. 


Thus, the intercept and slope coefficient on faminc are unchanged, but the coefficient on packs is 
20 times that on cigs. This is intuitively appealing. The results from the regression of bwght on 
packs and faminc are in column (3) of Table 6.1. 


; Incidentally, remember that it would make no sense 
es GOING FURTHER 6.1 to include both cigs and packs in the same equa- 


In the original birth weight equation (6.1), tion; this would induce perfect multicollinearity and 
suppose that faminc is measured in dollars would have no interesting meaning. 

rather than in thousands of dollars. Thus, Other than the coefficient on packs, there is one 
define the variable fincdo! = 1,000-faminc. | other statistic in column (3) that differs from that in 


How will the OLS statistics change when | column (1): the standard error on packs is 20 times 

fincdol is substituted for famine? For the larger than that on cigs in column (1). This means 

punpose oi Prc enung ine recresen re- that the ¢ statistic for testing the significance of 

sults, do you think it is better to measure ; ELD 

income in dollars or in thousands of dollars? EMO smoking a the same whether We OESE 
smoking in terms of cigarettes or packs. This is only 
natural. 

The previous example spells out most of the possibilities that arise when the dependent and inde- 
pendent variables are rescaled. Rescaling is often done with dollar amounts in economics, especially 
when the dollar amounts are very large. 

In Chapter 2, we argued that, if the dependent variable appears in logarithmic form, changing the 
unit of measurement does not affect the slope coefficient. The same is true here: changing the unit 
of measurement of the dependent variable, when it appears in logarithmic form, does not affect any 
of the slope estimates. This follows from the simple fact that log(c,y;) = log(c,) + log(y;) for any 
constant c} > 0. The new intercept will be log(c,) + Êo- Similarly, changing the unit of measurement 
of any x;, where log(x;) appears in the regression, only affects the intercept. This corresponds to what 
we know about percentage changes and, in particular, elasticities: they are invariant to the units of 
measurement of either y or the x;. For example, if we had specified the dependent variable in (6.1) to 
be log(bwght), estimated the equation, and then reestimated it with log(bwghtlbs) as the dependent 
variable, the coefficients on cigs and faminc would be the same in both regressions; only the intercept 
would be different. 
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6-1a Beta Coefficients 


Sometimes, in econometric applications, a key variable is measured on a scale that is difficult to inter- 
pret. Labor economists often include test scores in wage equations, and the scale on which these tests 
are scored is often arbitrary and not easy to interpret (at least for economists!). In almost all cases, 
we are interested in how a particular individual’s score compares with the population. Thus, instead 
of asking about the effect on hourly wage if, say, a test score is 10 points higher, it makes more sense 
to ask what happens when the test score is one standard deviation higher. 

Nothing prevents us from seeing what happens to the dependent variable when an independent 
variable in an estimated model increases by a certain number of standard deviations, assuming that 
we have obtained the sample standard deviation of the independent variable (which is easy in most 
regression packages). This is often a good idea. So, for example, when we look at the effect of a stan- 
dardized test score, such as the SAT score, on college GPA, we can find the standard deviation of SAT 
and see what happens when the SAT score increases by one or two standard deviations. 

Sometimes, it is useful to obtain regression results when all variables involved, the dependent as 
well as all the independent variables, have been standardized. A variable is standardized in the sample by 
subtracting off its mean and dividing by its standard deviation (see Math Refresher C). This means that 
we compute the z-score for every variable in the sample. Then, we run a regression using the z-scores. 

Why is standardization useful? It is easiest to start with the original OLS equation, with the vari- 
ables in their original forms: 


Yi = Bo + Bixa + Borin H+ + BeXig + HH. [6.2] 


We have included the observation subscript i to emphasize that our standardization is applied to all 
sample values. Now, if we average (6.2), use the fact that the #; have a zero sample average, and sub- 
tract the result from (6.2), we get 


A A 


yr y= (xa = x1) + Bolxin a X) pep Bixi = Xy) + hij. 


Now, let 6, be the sample standard deviation for the dependent variable, let 6, be the sample sd for x, 
let G, be the sample sd for x,, and so on. Then, simple algebra gives the equation 


x 
+ (6/6,) Bil Xik T x, )/6; | T (a/6,). [6.3] 


Each variable in (6.3) has been standardized by replacing it with its z-score, and this has resulted in 
new slope coefficients. For example, the slope coefficient on (x; — X,)/6, is (6,/ 6,) Bi. This is sim- 
ply the original coefficient, B 1» Multiplied by the ratio of the standard deviation of x, to the standard 
deviation of y. The intercept has dropped out altogether. 

It is useful to rewrite (6.3), dropping the 7 subscript, as 


Ly = bz, + bazz Se Biz + error, [6.4] 
where z, denotes the z-score of y, z; is the z-score of xı, and so on. The new coefficients are 
b, = (6/6,)B; forj = 1,...,k. [6.5] 
These b, are traditionally called standardized coefficients or beta coefficients. (The latter name is 


more common, which is unfortunate because we have been using beta hat to denote the usual OLS 
estimates.) 
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Beta coefficients receive their interesting meaning from equation (6.4): if x, increases by one 
standard deviation, then ĵ changes by b, standard deviations. Thus, we are measuring effects not in 
terms of the original units of y or the x;, but in standard deviation units. Because it makes the scale of 
the regressors irrelevant, this equation puts the explanatory variables on equal footing. In a standard 
OLS equation, it is not possible to simply look at the size of different coefficients and conclude that 
the explanatory variable with the largest coefficient is “the most important.” We just saw that the 
magnitudes of coefficients can be changed at will by changing the units of measurement of the x. 
But, when each x; has been standardized, comparing the magnitudes of the resulting beta coefficients 
is more compelling. When the regression equation has only a single explanatory variable, x,, its stan- 
dardized coefficient is simply the sample correlation coefficient between y and x,, which means it 
must lie in the range —1 to 1. 

Even in situations in which the coefficients are easily interpretable—say, the dependent variable 
and independent variables of interest are in logarithmic form, so the OLS coefficients of interest are 
estimated elasticities—there is still room for computing beta coefficients. Although elasticities are 
free of units of measurement, a change in a particular explanatory variable by, say, 10% may repre- 
sent a larger or smaller change over a variable’s range than changing another explanatory variable by 
10%. For example, in a state with wide income variation but relatively little variation in spending per 
student, it might not make much sense to compare performance elasticities with respect to the income 
and spending. Comparing beta coefficient magnitudes can be helpful. 

To obtain the beta coefficients, we can always standardize y, x,,..., x, and then run the OLS 
regression of the z-score of y on the z-scores of x,,..., x,—where it is not necessary to include an 
intercept, as it will be zero. This can be tedious with many independent variables. Many regression 
packages provide beta coefficients via a simple command. The following example illustrates the use 
of beta coefficients. 


Effects of Pollution on Housing Prices 


We use the data from Example 4.5 (in the file HPRICE2) to illustrate the use of beta coefficients. 
Recall that the key independent variable is nox, a measure of the nitrogen oxide in the air over each 
community. One way to understand the size of the pollution effect—without getting into the science 
underlying nitrogen oxide’s effect on air quality—is to compute beta coefficients. (An alternative 
approach is contained in Example 4.5: we obtained a price elasticity with respect to nox by using 
price and nox in logarithmic form.) 

The population equation is the level-level model 


price = By + B,nox + crime + Brooms + Bydist + Bsstratio + u, 


where all the variables except crime were defined in Example 4.5; crime is the number of reported 
crimes per capita. The beta coefficients are reported in the following equation (so each variable has 
been converted to its z-score): 


——~. 
zprice = —.340 znox — .143 zcrime + .514 zrooms — .235 zdist — .270 zstratio. 


This equation shows that a one standard deviation increase in nox decreases price by .34 standard 
deviation; a one standard deviation increase in crime reduces price by .14 standard deviation. Thus, 
the same relative movement of pollution in the population has a larger effect on housing prices than 
crime does. Size of the house, as measured by number of rooms (rooms), has the largest standard- 
ized effect. If we want to know the effects of each independent variable on the dollar value of median 
house price, we should use the unstandardized variables. 

Whether we use standardized or unstandardized variables does not affect statistical significance: 
the ż statistics are the same in both cases. 
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6-2 More on Functional Form 


In several previous examples, we have encountered the most popular device in econometrics for 
allowing nonlinear relationships between the explained and explanatory variables: using logarithms 
for the dependent or independent variables. We have also seen models containing quadratics in some 
explanatory variables, but we have yet to provide a systematic treatment of them. In this section, we 
cover some variations and extensions on functional forms that often arise in applied work. 


6-2a More on Using Logarithmic Functional Forms 


We begin by reviewing how to interpret the parameters in the model 
log(price) = By + B,log(nox) + Brooms + u, [6.6] 


where these variables are taken from Example 4.5. Recall that throughout the text log(x) is the natural 
log of x. The coefficient £; is the elasticity of price with respect to nox (pollution). The coefficient B, 
is the change in log( price), when Arooms = 1; as we have seen many times, when multiplied by 100, 
this is the approximate percentage change in price. Recall that 100-8, is sometimes called the semi- 
elasticity of price with respect to rooms. 

When estimated using the data in HPRICE2, we obtain 


fom rice) = 9.23 — .718 log(nox) + .306 rooms 
(0.19) (.066) (.019) [6.7] 
n = 506, R? = 514. 


Thus, when nox increases by 1%, price falls by .718%, holding only rooms fixed. When rooms 
increases by one, price increases by approximately 100(.306) = 30.6%. 

The estimate that one more room increases price by about 30.6% turns out to be somewhat 
inaccurate for this application. The approximation error occurs because, as the change in log(y) 
becomes larger and larger, the approximation %Ay = 100-Alog(y) becomes more and more inaccu- 
rate. Fortunately, a simple calculation is available to compute the exact percentage change. 

To describe the procedure, we consider the general estimated model 


iog(y) = Bo + Bilog(x;) + om 


(Adding additional independent variables does not change the procedure.) Now, fixing x,, we have 
Alog(y) = 6,Ax>. Using simple algebraic properties of the exponential and logarithmic functions 
gives the exact percentage change in the predicted y as 


%AS = 100-[exp(B.Ax.) — 1], [6.8] 


where the multiplication by 100 turns the proportionate change into a percentage change. When 
Ax, = 1, 


%AS = 100-[exp(B,) — 1]. [6.9] 


Applied to the housing price example with x, = rooms and Bo = .306, % Aprice = 100[exp(.306) 
—1] = 35.8%, which is notably larger than the approximate percentage change, 30.6%, obtained 
directly from (6.7). {Incidentally, this is not an unbiased estimator because exp(-) is a nonlinear 
function; it is, however, a consistent estimator of 100[exp(,) — 1]. This is because the probability 
limit passes through continuous functions, while the expected value operator does not. See Math 
Refresher C. } 
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The adjustment in equation (6.8) is not as crucial for small percentage changes. For example, 
when we include the student-teacher ratio in equation (6.7), its estimated coefficient is -.052, which 
means that if stratio increases by one, price decreases by approximately 5.2%. The exact proportion- 
ate change is exp(—.052)—1 = —.051, or —5.1%. On the other hand, if we increase stratio by five, 
then the approximate percentage change in price is —-26%, while the exact change obtained from 
equation (6.8) is 100[exp(—.26)—1] =~ —22.9%. 

The logarithmic approximation to percentage changes has an advantage that justifies its reporting 
even when the percentage change is large. To describe this advantage, consider again the effect on 
price of changing the number of rooms by one. The logarithmic approximation is just the coefficient 
on rooms in equation (6.7) multiplied by 100, namely, 30.6%. We also computed an estimate of the 
exact percentage change for increasing the number of rooms by one as 35.8%. But what if we want to 
estimate the percentage change for decreasing the number of rooms by one? In equation (6.8) we take 
Ax, = —1 and ĝ, = .306, and so %Aprice = 100[exp(—.306)—1] = —26.4, or a drop of 26.4%. 
Notice that the approximation based on using the coefficient on rooms is between 26.4 and 35.8—an 
outcome that always occurs. In other words, simply using the coefficient (multiplied by 100) gives us 
an estimate that is always between the absolute value of the estimates for an increase and a decrease. 
If we are specifically interested in an increase or a decrease, we can use the calculation based on 
equation (6.8). 

The point just made about computing percentage changes is essentially the one made in introduc- 
tory economics when it comes to computing, say, price elasticities of demand based on large price 
changes: the result depends on whether we use the beginning or ending price and quantity in comput- 
ing the percentage changes. Using the logarithmic approximation is similar in spirit to calculating an 
arc elasticity of demand, where the averages of prices and quantities are used in the denominators in 
computing the percentage changes. 

We have seen that using natural logs leads to coefficients with appealing interpretations, and we 
can be ignorant about the units of measurement of variables appearing in logarithmic form because 
the slope coefficients are invariant to rescalings. There are several other reasons logs are used so much 
in applied work. First, when y > 0, models using log(y) as the dependent variable often satisfy the 
CLM assumptions more closely than models using the level of y. Strictly positive variables often have 
conditional distributions that are heteroskedastic or skewed; taking the log can mitigate, if not elimi- 
nate, both problems. 

Another potential benefit of using logs is that taking the log of a variable often narrows its range. 
This is particularly true of variables that can be large monetary values, such as firms’ annual sales or 
baseball players’ salaries. Population variables also tend to vary widely. Narrowing the range of the 
dependent and independent variables can make OLS estimates less sensitive to outlying (or extreme) 
values; we take up the issue of outlying observations in Chapter 9. 

However, one must not indiscriminately use the logarithmic transformation because in some cases 
it can actually create extreme values. An example is when a variable y is between zero and one (such 
as a proportion) and takes on values close to zero. In this case, log(y) (which is necessarily negative) 
can be very large in magnitude whereas the original variable, y, is bounded between zero and one. 

There are some standard rules of thumb for taking logs, although none is written in stone. When 
a variable is a positive dollar amount, the log is often taken. We have seen this for variables such 
as wages, salaries, firm sales, and firm market value. Variables such as population, total number of 
employees, and school enrollment often appear in logarithmic form; these have the common feature 
of being large integer values. 

Variables that are measured in years—such as education, experience, tenure, age, and so on— 
usually appear in their original form. A variable that is a proportion or a percent—such as the 
unemployment rate, the participation rate in a pension plan, the percentage of students passing a 
standardized exam, and the arrest rate on reported crimes—can appear in either original or logarith- 
mic form, although there is a tendency to use them in level forms. This is because any regression 
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coefficients involving the original variable—whether it is the dependent or independent variable— 
will have a percentage point change interpretation. (See Math Refresher A for a review of the dis- 
tinction between a percentage change and a percentage point change.) If we use, say, log(unem) in a 
regression, where unem is the percentage of unemployed individuals, we must be very careful to dis- 
tinguish between a percentage point change and a percentage change. Remember, if unem goes from 
8 to 9, this is an increase of one percentage point, but a 12.5% increase from the initial unemploy- 
ment level. Using the log means that we are looking at the percentage change in the unemployment 
rate: log(9) — log(8) ~ .118 or 11.8%, which is the logarithmic approximation to the actual 12.5% 

increase. 
One limitation of the log is that it cannot be used if a variable takes on zero or negative values. In 
cases where a variable y is nonnegative but can take on the value 0, log(1 +y) is sometimes used. The 
percentage change interpretations are often closely 


- preserved, except for changes beginning at y = 0 
Ba GOING FURTHER 6.2 (where the percentage change is not even defined). 


Suppose that the annual number of drunk 
driving arrests is determined by 


log(arrests) = Bo + B,log(pop) 
+ B,age16_25 + other 
factors, 


where age 76_25 is the proportion of the 
population between 16 and 25 years of 


Generally, using log(1 +y) and then interpreting the 
estimates as if the variable were log(y) is accept- 
able when the data on y contain relatively few zeros. 
An example might be where y is hours of training 
per employee for the population of manufacturing 
firms, if a large fraction of firms provides training 
to at least one worker. Technically, however, log (1 +y) 
cannot be normally distributed (although it might 


age. Show that B, has the following (ceteris 
paribus) interpretation: it is the percentage 
change in arrests when the percentage of 
the people aged 16 to 25 increases by one 
percentage point. 


be less heteroskedastic than y). Useful, albeit more 
advanced, alternatives are the Tobit and Poisson 
models in Chapter 17. 

One drawback to using a dependent variable in 
logarithmic form is that it is more difficult to predict 
the original variable. The original model allows us 
to predict log(y), not y. Nevertheless, it is fairly easy 
to turn a prediction for log(y) into a prediction for y (see Section 6-4). A related point is that it is not 
legitimate to compare R-squareds from models where y is the dependent variable in one case and 
log(y) is the dependent variable in the other. These measures explain variations in different variables. 
We discuss how to compute comparable goodness-of-fit measures in Section 6-4. 


6-2b Models with Quadratics 


Quadratic functions are also used quite often in applied economics to capture decreasing or increas- 
ing marginal effects. You may want to review properties of quadratic functions in Math Refresher A. 
In the simplest case, y depends on a single observed factor x, but it does so in a quadratic fashion: 


y = bo + Bix + BX + u. 


For example, take y = wage and x = exper. As we discussed in Chapter 3, this model falls outside of 
simple regression analysis but is easily handled with multiple regression. 
It is important to remember that 6, does not measure the change in y with respect to x; it makes 
no sense to hold x? fixed while changing x. If we write the estimated equation as 
3 = Po + Bix + Box", [6.10] 


then we have the approximation 


Aĵ =~ (Ê; + 2Box)Ax, so Aj/Ax ~ B, + 2Box. [6.11] 
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This says that the slope of the relationship between x and y depends on the value of x; the estimated 
slope is Êi Ez 2Box. If we plug in x = 0, we see that B, can be interpreted as the approximate slope in 
going from x = 0 tox = 1. After that, the second term, 2Box, must be accounted for. 

If we are only interested in computing the predicted change in y given a starting value for x and a 
change in x, we could use (6.10) directly: there is no reason to use the calculus approximation at all. 
However, we are usually more interested in quickly summarizing the effect of x on y, and the interpre- 
tation of B ı and Bo in equation (6.11) provides that summary. Typically, we might plug in the average 
value of x in the sample, or some other interesting values, such as the median or the lower and upper 
quartile values. 

In many applications, Êi is positive and By is negative. For example, using the wage data in 
WAGE 1, we obtain 


wage = 3.73 + .298 exper — .0061 exper’ 
(.35) (.041) (.0009) [6.12] 
n = 526, R? = .093. 


This estimated equation implies that exper has a diminishing effect on wage. The first year of expe- 
rience is worth roughly 30¢ per hour ($.298). The second year of experience is worth less [about 
.298 — 2(.0061)(1) = .286, or 28.6¢, according to the approximation in (6.11) with x = 1]. In going 
from 10 to 11 years of experience, wage is predicted to increase by about .298 — 2(.0061)(10) = .176, 
or 17.6¢. And so on. 

When the coefficient on x is positive and the coefficient on x? is negative, the quadratic has a 
parabolic shape. There is always a positive value of x where the effect of x on y is zero; before this 
point, x has a positive effect on y; after this point, x has a negative effect on y. In practice, it can be 
important to know where this turning point is. 

In the estimated equation (6.10) with B, > 0 and A < 0, the turning point (or maximum of 
the function) is always achieved at the coefficient on x over twice the absolute value of the coefficient 
on x7: 


x* = |B,/(28,)|. [6.13] 


In the wage example, x* = exper* is .298/[2(.0061) ] = 24.4. (Note how we just drop the minus sign 
on —.0061 in doing this calculation.) This quadratic relationship is illustrated in Figure 6.1. 

In the wage equation (6.12), the return to experience becomes zero at about 24.4 years. What 
should we make of this? There are at least three possible explanations. First, it may be that few 
people in the sample have more than 24 years of experience, and so the part of the curve to the 
right of 24 can be ignored. The cost of using a quadratic to capture diminishing effects is that 
the quadratic must eventually turn around. If this point is beyond all but a small percentage of 
the people in the sample, then this is not of much concern. But in the data set WAGE1, about 28% 
of the people in the sample have more than 24 years of experience; this is too high a percentage 
to ignore. 

It is possible that the return to exper really becomes negative at some point, but it is hard to 
believe that this happens at 24 years of experience. A more likely possibility is that the estimated 
effect of exper on wage is biased because we have controlled for no other factors, or because the 
functional relationship between wage and exper in equation (6.12) is not entirely correct. Computer 
Exercise C2 asks you to explore this possibility by controlling for education, in addition to using 
log(wage) as the dependent variable. 

When a model has a dependent variable in logarithmic form and an explanatory variable entering 
as a quadratic, some care is needed in reporting the partial effects. The following example also shows 
that the quadratic can have a U-shape, rather than a parabolic shape. A U-shape arises in equation 
(6.10) when B ı İs negative and LA is positive; this captures an increasing effect of x on y. 
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FIGURE 6.1 Quadratic relationship between wage and exper. 


Pi 
wage 


24.4 exper 


Effects of Pollution on Housing Prices 


We modify the housing price model from Example 4.5 to include a quadratic term in rooms: 


log(price) = By + Bilog(nox) + Bolog(dist) + Brooms 
+ B,rooms* + B;stratio + u. [6.14] 


The model estimated using the data in HPRICE2 is 
—_—_ 
log(price) = 13.39 — .902 log(nox) — .087 log(dist) 


(.57) (.115) (.043) 
— 545 rooms + .062 rooms? — .048 stratio 
(.165) (.013) (.006) 


n = 506, R? = .603. 


The quadratic term rooms” has a t statistic of about 4.77, and so it is very statistically significant. But 
what about interpreting the effect of rooms on log(price)? Initially, the effect appears to be strange. 
Because the coefficient on rooms is negative and the coefficient on rooms? is positive, this equation 
literally implies that, at low values of rooms, an additional room has a negative effect on log(price). 
At some point, the effect becomes positive, and the quadratic shape means that the semi-elasticity of 
price with respect to rooms is increasing as rooms increases. This situation is shown in Figure 6.2. 

We obtain the turnaround value of rooms using equation (6.13) (even though Êi is negative and 
A is positive). The absolute value of the coefficient on rooms, .545, divided by twice the coefficient 
on rooms”, .062, gives rooms* = .545/[2(.062)] = 4.4; this point is labeled in Figure 6.2. 

Do we really believe that starting at three rooms and increasing to four rooms actually reduces a 
house’s expected value? Probably not. It turns out that only five of the 506 communities in the sample 
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FIGURE 6.2 log(price) as a quadratic function of rooms. 
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have houses averaging 4.4 rooms or less, about 1% of the sample. This is so small that the quadratic 
to the left of 4.4 can, for practical purposes, be ignored. To the right of 4.4, we see that adding another 
room has an increasing effect on the percentage change in price: 


Alog(price) = {[—.545 + 2(.062) Jrooms}Arooms 


and so 
% Aprice = 100{[—.545 + 2(.062) Jrooms}Arooms 
= (—54.5 + 12.4 rooms) Arooms. 


Thus, an increase in rooms from, say, five to six increases price by about —54.5 + 12.4(5) = 7.5%; 
the increase from six to seven increases price by roughly —54.5 + 12.4(6) = 19.9%. This is a very 
strong increasing effect. 

The strong increasing effect of rooms on log(price) in this example illustrates an important les- 
son: one cannot simply look at the coefficient on the quadratic term—in this case, .062—and declare 
that it is too small to bother with, based only on its magnitude. In many applications with quadratics, 
the coefficient on the squared variable has one or more zeros after the decimal point: after all, this 
coefficient measures how the slope is changing as x (rooms) changes. A seemingly small coefficient 
can have practically important consequences, as we just saw. As a general rule, one must compute the 
partial effect and see how it varies with x to determine if the quadratic term is practically important. In 
doing so, it is useful to compare the changing slope implied by the quadratic model with the constant 
slope obtained from the model with only a linear term. If we drop rooms? from the equation, the coef- 
ficient on rooms becomes about .255, which implies that each additional room—starting from any 
number of rooms—increases median price by about 25.5%. This is very different from the quadratic 
model, where the effect becomes 25.5% at rooms = 6.45 but changes rapidly as rooms gets smaller 
or larger. For example, at rooms = 7, the return to the next room is about 32.3%. 
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What happens generally if the coefficients on the level and squared terms have the same sign 
(either both positive or both negative) and the explanatory variable is necessarily nonnegative (as in 
the case of rooms or exper)? In either case, there is no turning point for values x > 0. For example, 
if B, and $, are both positive, the smallest expected value of y is at x = 0, and increases in x always 
have a positive and increasing effect on y. (This is also true if 8, = 0 and B, > 0, which means that 
the partial effect is zero at x = O and increasing as x increases.) Similarly, if 6, and £, are both nega- 
tive, the largest expected value of y is at x = 0, and increases in x have a negative effect on y, with the 
magnitude of the effect increasing as x gets larger. 

The general formula for the turning point of any quadratic is x" = —B,/(2B), which leads to a 
positive value if B ı and LA have opposite signs and a negative value when B, and B, have the same 
sign. Knowing this simple formula is useful in cases where x may take on both positive and negative 
values; one can compute the turning point and see if it makes sense, taking into account the range of 
x in the sample. 

There are many other possibilities for using quadratics along with logarithms. For example, an 
extension of (6.14) that allows a nonconstant elasticity between price and nox is 


log(price) = By + Bilog(nox) + Bo[log(nox) P 
+ B,crime + Brooms + Bsrooms” + Bgstratio + u. [6.15] 


If 6, = 0, then £; is the elasticity of price with respect to nox. Otherwise, this elasticity depends on 
the level of nox. To see this, we can combine the arguments for the partial effects in the quadratic and 
logarithmic models to show that 


%Aprice = [B, + 2B,log(nox) |%Anox; [6.16] 


therefore, the elasticity of price with respect to nox is B, + 2B,log(nox), so that it depends on 
log(nox). 

Finally, other polynomial terms can be included in regression models. Certainly, the quadratic is 
seen most often, but a cubic and even a quartic term appear now and then. An often reasonable func- 
tional form for a total cost function is 


cost = By + B,quantity + B.quantity’ + B,quantity’ + u. 
Estimating such a model causes no complications. Interpreting the parameters is more involved 


(though straightforward using calculus); we do not study these models further. 


6-2c Models with Interaction Terms 
Sometimes, it is natural for the partial effect, elasticity, or semi-elasticity of the dependent variable 
with respect to an explanatory variable to depend on the magnitude of yet another explanatory vari- 
able. For example, in the model 

price = By + B,sqrft + Bobdrms + B3sqrft-bdrms + Bybthrms + u, 


the partial effect of bdrms on price (holding all other variables fixed) is 


Aprice 
Abdrms 


= P, + Bysqrft. [6.17] 


If 6; > 0, then (6.17) implies that an additional bedroom yields a higher increase in housing price for 
larger houses. In other words, there is an interaction effect between square footage and number of 
bedrooms. In summarizing the effect of bdrms on price, we must evaluate (6.17) at interesting values 
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of sqrft, such as the mean value, or the lower and upper quartiles in the sample. Whether or not 6, is 
zero is something we can easily test. 

The parameters on the original variables can be tricky to interpret when we include an interac- 
tion term. For example, in the previous housing price equation, equation (6.17) shows that B, is the 
effect of bdrms on price for a home with zero square feet! This effect is clearly not of much interest. 
Instead, we must be careful to put interesting values of sqrft, such as the mean or median values in the 
sample, into the estimated version of equation (6.17). 

Often, it is useful to reparameterize a model so that the coefficients on the original variables have 
an interesting meaning. Consider a model with two explanatory variables and an interaction: 


y = Bo + Bix, + Box. + px + u. 


As just mentioned, 6, is the partial effect of x, on y when x, = 0. Often, this is not of interest. 
Instead, we can reparameterize the model as 


y = a + ôx + Ôx + B3(x) z My) (x = w) +u, 


where u is the population mean of x, and m, is the population mean of x,. We can easily see that 
now the coefficient on x5, 65, is the partial effect of x, on y at the mean value of x,. (By multiplying 
out the interaction in the second equation and comparing the coefficients, we can easily show that 
6, = B, + zu. The parameter 6, has a similar interpretation.) Therefore, if we subtract the means 
of the variables—in practice, these would typically be the sample means—before creating the interac- 
tion term, the coefficients on the original variables have a useful interpretation. Plus, we immediately 
obtain standard errors for the partial effects at the mean values. Nothing prevents us from replacing 
H OF u, With other values of the explanatory variables that may be of interest. The following example 
illustrates how we can use interaction terms. 


Effects of Attendance on Final Exam Performance 


A model to explain the standardized outcome on a final exam (stndfnl) in terms of percentage of 
classes attended, prior college grade point average, and ACT score is 


stndfnl = By + B,atndrte + B,priGPA + BACT + B,priGPA? 
+ B;ACT + B.priGPA-atndrte + u. [6.18] 


(We use the standardized exam score for the reasons discussed in Section 6-1: it is easier to inter- 
pret a student’s performance relative to the rest of the class.) In addition to quadratics in priGPA 
and ACT, this model includes an interaction between priGPA and the attendance rate. The idea is 
that class attendance might have a different effect for students who have performed differently in 
the past, as measured by priGPA. We are interested in the effects of attendance on final exam score: 
Astndfnl/Aatndrte = B, + BepriGPA. 

Using the 680 observations in ATTEND, for students in a course on microeconomic principles, 
the estimated equation is 


P aa, 
stndfnl = 2.05 — .0067 atndrte — 1.63 priGPA — 128 ACT 


(1.36) (.0102) (.48) (.098) 
+ .296 priGPA” + .0045 ACT? + .0056 priGPA-atndrte [6.19] 
(.101) (.0022) (.0043) 


n = 680, R? = .229, R = .222. 
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We must interpret this equation with extreme care. If we simply look at the coefficient on atndrte, we 
will incorrectly conclude that attendance has a negative effect on final exam score. But this coefficient 
supposedly measures the effect when priGPA = 0, which is not interesting (in this sample, the small- 
est prior GPA is about .86). We must also take care not to look separately at the estimates of 6, and 
Bs and conclude that, because each f statistic is insignificant, we cannot reject Hy: B, = 0, B6 = 0. 
In fact, the p-value for the F test of this joint hypothesis is .014, so we certainly reject H, at the 5% 
level. This is a good example of where looking at separate f statistics when testing a joint hypothesis 
can lead one far astray. 

How should we estimate the partial effect of atndrte on stndfnl? We must plug in interesting 
values of priGPA to obtain the partial effect. The mean value of priGPA in the sample is 2.59, so at 
the mean priGPA, the effect of atndrte on stndfnl is —.0067 + .0056(2.59) ~= .0078. What does this 
mean? Because atndrte is measured as a percentage, it means that a 10 percentage point increase in 
atndrte increases sindfnl by .078 standard deviations from the mean final exam score. 

How can we tell whether the estimate .0078 is 

GOING FURTHER 6.3 statistically different from zero? We need to rerun 

the regression, where we replace priGPA-atndrte 

If we add the term B, ACT-atndrte to with (priGPA — 2.59)-atndrte. This gives, as the 

equation (6.18), what is the partial effect of new coefficient on atndrte, the estimated effect at 

atndrte on stndfnl? priGPA = 2.59, along with its standard error; noth- 

ing else in the regression changes. (We described 

this device in Section 4.4.) Running this new regres- 

sion gives the standard error of B, + Bg(2.59) = .0078 as .0026, which yields t = .0078/.0026 = 3. 

Therefore, at the average priGPA, we conclude that attendance has a statistically significant positive 
effect on final exam score. 

Things are even more complicated for finding the effect of priGPA on stndfnl because of 
the quadratic term priGPA’. To find the effect at the mean value of priGPA and the mean atten- 
dance rate, 82, we would replace priGPA* with (priGPA — 2.59} and priGPA-atndrte with 
priGPA-(atndrte — 82). The coefficient on priGPA becomes the partial effect at the mean values, and 
we would have its standard error. (See Computer Exercise C7.) 


6-2d Computing Average Partial Effects 


The hallmark of models with quadratics, interactions, and other nonlinear functional forms is that the 
partial effects depend on the values of one or more explanatory variables. For example, we just saw in 
Example 6.3 that the effect of atndrte depends on the value of priGPA. It is easy to see that the partial 
effect of priGPA in equation (6.18) is 


B2 + 2BypriGPA + Beatndrte 


(something that can be verified with simple calculus or just by combining the quadratic and interac- 
tion formulas). The embellishments in equation (6.18) can be useful for seeing how the strength of 
associations between stndfnl and each explanatory variable changes with the values of all explanatory 
variables. The flexibility afforded by a model such as (6.18) does have a cost: it is tricky to describe 
the partial effects of the explanatory variables on stndfnl with a single number. 

Often, one wants a single value to describe the relationship between the dependent variable y and 
each explanatory variable. One popular summary measure is the average partial effect (APE), also 
called the average marginal effect. The idea behind the APE is simple for models such as (6.18). After 
computing the partial effect and plugging in the estimated parameters, we average the partial effects 
for each unit across the sample. So, the estimated partial effect of atndrte on stndfnl is 


B, + BopriGPA,. 
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We do not want to report this partial effect for each of the 680 students in our sample. Instead, we 
average these partial effects to obtain 


APE stnafni = Bi F ÊspriGPA, 


where priGPA is the sample average of priGPA. The single number APEsmapı 18 the (estimated) APE. 
The APE of priGPA is only a little more complicated: 


APE, ,icpa = By + 2BypriGPA + Batndrte. 


Both APE ging and APE,,,;cp4 tell us the size of the partial effects on average. 

The centering of explanatory variables about their sample averages before creating quadratics or 
interactions forces the coefficient on the levels to be the APEs. This can be cumbersome in compli- 
cated models. Fortunately, some commonly used regression packages compute APEs with a simple 
command after OLS estimation. Just as importantly, proper standard errors are computed using the 
fact that an APE is a linear combination of the OLS coefficients. For example, the APEs and their 
standard errors for models with both quadratics and interactions, as in Example 6.3, are easy to obtain. 

APEs are also useful in models that are inherently nonlinear in parameters, which we treat in 
Chapter 17. At that point we will revisit the definition and calculation of APEs. 


6-3 More on Goodness-of-Fit and Selection of Regressors 


Until now, we have not focused much on the size of R? in evaluating our regression models, primarily 
because beginning students tend to put too much weight on R-squared. As we will see shortly, choos- 
ing a set of explanatory variables based on the size of the R-squared can lead to nonsensical models. 
In Chapter 10, we will discover that R-squareds obtained from time series regressions can be artifi- 
cially high and can result in misleading conclusions. 

Nothing about the classical linear model assumptions requires that R? be above any particular 
value; R? is simply an estimate of how much variation in y is explained by x,, x, ... , x; in the popu- 
lation. We have seen several regressions that have had pretty small R-squareds. Although this means 
that we have not accounted for several factors that affect y, this does not mean that the factors in u 
are correlated with the independent variables. The zero conditional mean assumption MLR.4 is what 
determines whether we get unbiased estimators of the ceteris paribus effects of the independent vari- 
ables, and the size of the R-squared has no direct bearing on this. 

A small R-squared does imply that the error variance is large relative to the variance of y, which 
means we may have a hard time precisely estimating the 6;. But remember, we saw in Section 3.4 that 
a large error variance can be offset by a large sample size: if we have enough data, we may be able 
to precisely estimate the partial effects even though we have not controlled for many unobserved fac- 
tors. Whether or not we can get precise enough estimates depends on the application. For example, 
suppose that some incoming students at a large university are randomly given grants to buy computer 
equipment. If the amount of the grant is truly randomly determined, we can estimate the ceteris pari- 
bus effect of the grant amount on subsequent college grade point average by using simple regression 
analysis. (Because of random assignment, all of the other factors that affect GPA would be uncor- 
related with the amount of the grant.) It seems likely that the grant amount would explain little of the 
variation in GPA, so the R-squared from such a regression would probably be very small. But, if we 
have a large sample size, we still might get a reasonably precise estimate of the effect of the grant. 

Another good illustration of where poor explanatory power has nothing to do with unbiased esti- 
mation of the $; is given by analyzing the data set APPLE. Unlike the other data sets we have used, 
the key explanatory variables in APPLE were set experimentally—that is, without regard to other 
factors that might affect the dependent variable. The variable we would like to explain, ecolbs, is the 
(hypothetical) pounds of “ecologically friendly” (“ecolabeled’’) apples a family would demand. Each 
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family (actually, family head) was presented with a description of ecolabeled apples, along with prices 
of regular apples (regprc) and prices of the hypothetical ecolabeled apples (ecoprc). Because the 
price pairs were randomly assigned to each family, they are unrelated to other observed factors (such 
as family income) and unobserved factors (such as desire for a clean environment). Therefore, the 
regression of ecolbs on ecoprc, regprc (across all samples generated in this way) produces unbiased 
estimators of the price effects. Nevertheless, the R-squared from the regression is only .0364: the 
price variables explain only about 3.6% of the total variation in ecolbs. So, here is a case where 
we explain very little of the variation in y, yet we are in the rare situation of knowing that the data 
have been generated so that unbiased estimation of the 6; is possible. (Incidentally, adding observed 
family characteristics has a very small effect on explanatory power. See Computer Exercise C11.) 

Remember, though, that the relative change in the R-squared when variables are added to an 
equation is very useful: the F statistic in (4.41) for testing the joint significance crucially depends on 
the difference in R-squareds between the unrestricted and restricted models. 

As we will see in Section 6.4, an important consequence of a low R-squared is that prediction is 
difficult. Because most of the variation in y is explained by unobserved factors (or at least factors we 
do not include in our model), we will generally have a hard time using the OLS equation to predict 
individual future outcomes on y given a set of values for the explanatory variables. In fact, the low 
R-squared means that we would have a hard time predicting y even if we knew the £, the population 
coefficients. Fundamentally, most of the factors that explain y are unaccounted for in the explanatory 
variables, making prediction difficult. 


6-3a Adjusted R-Squared 


Most regression packages will report, along with the R-squared, a statistic called the adjusted 
R-squared. Because the adjusted R-squared is reported in much applied work, and because it has 
some useful features, we cover it in this subsection. 

To see how the usual R-squared might be adjusted, it is usefully written as 


R? = 1 — (SSR/n)/(SST/n), [6.20] 


where SSR is the sum of squared residuals and SST is the total sum of squares; compared with equa- 
tion (3.28), all we have done is divide both SSR and SST by n. This expression reveals what R? is actu- 
ally estimating. Define ø? as the population variance of y and let a? denote the population variance of 
the error term, u. (Until now, we have used o” to denote 2, but it is helpful to be more specific here.) 
The population R-squared is defined as p? = 1 — 02/07; this is the proportion of the variation in y 
in the population explained by the independent variables. This is what R? is supposed to be estimating. 

R? estimates ø? by SSR/n, which we know to be biased. So why not replace SSR/n with SSR/ 
(n —k— 1)? Also, we can use SST/(n — 1) in place of SST/n, as the former is the unbiased estimator of 
o,. Using these estimators, we arrive at the adjusted R-squared: 


R? 


1 — [SSR/(n — k — 1) V[SST/(n — 1)] 


[6.21] 
1 — &/[SST/(n — 1)], 


because 6? = SSR/(n — k — 1). Because of the notation used to denote the adjusted R-squared, it is 
sometimes called R-bar squared. 

The adjusted R-squared is sometimes called the corrected R-squared, but this is not a good name 
because it implies that R? is somehow better than R? as an estimator of the population R-squared. 
Unfortunately, R? is not generally known to be a better estimator. It is tempting to think that R? cor- 
rects the bias in R? for estimating the population R-squared, p°, but it does not: the ratio of two unbi- 
ased estimators is not an unbiased estimator. 
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The primary attractiveness of R? is that it imposes a penalty for adding additional independent 
variables to a model. We know that R? can never fall when a new independent variable is added to a 
regression equation: this is because SSR never goes up (and usually falls) as more independent vari- 
ables are added (assuming we use the same set of observations). But the formula for R? shows that it 
depends explicitly on k, the number of independent variables. If an independent variable is added to 
a regression, SSR falls, but so does the df in the regression, n — k — 1. SSR/(n — k — 1) can go up or 
down when a new independent variable is added to a regression. 

An interesting algebraic fact is the following: if we add a new independent variable to a regres- 
sion equation, R? increases if, and only if, the ż statistic on the new variable is greater than one in 
absolute value. (An extension of this is that R? increases when a group of variables is added to a 
regression if, and only if, the F statistic for joint significance of the new variables is greater than 
unity.) Thus, we see immediately that using R to decide whether a certain independent variable (or 
set of variables) belongs in a model gives us a different answer than standard f or F testing (because a 
t or F statistic of unity is not statistically significant at traditional significance levels). 

It is sometimes useful to have a formula for R? in terms of R°. Simple algebra gives 


Re=1-(1-R)(n—- 1)/(n—k-— 1). [6.22] 


For example, if R? = 30, n = 51, and k = 10, then R=1- .70(50)/40 = .125. Thus, for small 
n and large k, R? can be substantially below R°. In fact, if the usual R-squared is small, and n — k — 1 
is small, R? can actually be negative! For example, you can plug in R? = .10, n = 51, and k = 10 
to verify that R? = —.125. A negative R? indicates a very poor model fit relative to the number of 
degrees of freedom. 

The adjusted R-squared is sometimes reported along with the usual R-squared in regressions, and 
sometimes R? is reported in place of R°. It is important to remember that it is R°, not R?, that appears 
in the F statistic in (4.41). The same formula with R? and R?, is not valid. 


6-3b Using Adjusted R-Squared to Choose between 
Nonnested Models 


In Section 4-5, we learned how to compute an F statistic for testing the joint significance of a group 
of variables; this allows us to decide, at a particular significance level, whether at least one variable in 
the group affects the dependent variable. This test does not allow us to decide which of the variables 
has an effect. In some cases, we want to choose a model without redundant independent variables, 
and the adjusted R-squared can help with this. 

In the major league baseball salary example in Section 4-5, we saw that neither Arunsyr nor 
rbisyr was individually significant. These two variables are highly correlated, so we might want to 
choose between the models 


log(salary) = Bo + Biyears + B.gamesyr + B,bavg + Byhrunsyr + u 
and 
log(salary) = By + Byyears + Bogamesyr + B;bavg + Byrbisyr + u. 


These two equations are nonnested models because neither equation is a special case of the other. 
The F statistics we studied in Chapter 4 only allow us to test nested models: one model (the restricted 
model) is a special case of the other model (the unrestricted model). See equations (4.32) and (4.28) 
for examples of restricted and unrestricted models. One possibility is to create a composite model 
that contains all explanatory variables from the original models and then to test each model against 
the general model using the F test. The problem with this process is that either both models might 
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be rejected or neither model might be rejected (as happens with the major league baseball salary 
example in Section 4-5). Thus, it does not always provide a way to distinguish between models with 
nonnested regressors. 

In the baseball player salary regression using the data in MLB1, R? for the regression containing 
hrunsyr is .6211, and R? for the regression containing rbisyr is .6226. Thus, based on the adjusted 
R-squared, there is a very slight preference for the model with rbisyr. But the difference is practi- 
cally very small, and we might obtain a different answer by controlling for some of the variables in 
Computer Exercise C5 in Chapter 4. (Because both nonnested models contain five parameters, the 
usual R-squared can be used to draw the same conclusion.) 

Comparing R? to choose among different nonnested sets of independent variables can be valu- 
able when these variables represent different functional forms. Consider two models relating R&D 
intensity to firm sales: 


rdintens = By + B,log(sales) + u. [6.23] 
rdintens = By + B,sales + Bysales* + u. [6.24] 


The first model captures a diminishing return by including sales in logarithmic form; the second 
model does this by using a quadratic. Thus, the second model contains one more parameter than 
the first. 

When equation (6.23) is estimated using the 32 observations on chemical firms in RDCHEM, R? 
is .061, and R? for equation (6.24) is .148. Therefore, it appears that the quadratic fits much better. But 
a comparison of the usual R-squareds is unfair to the first model because it contains one fewer param- 
eter than (6.24). That is, (6.23) is a more parsimonious model than (6.24). 

Everything else being equal, simpler models are better. Because the usual R-squared does not 
penalize more complicated models, it is better to use R?. The R? for (6.23) is .030, while R? for (6.24) 
is .090. Thus, even after adjusting for the difference in degrees of freedom, the quadratic model wins 
out. The quadratic model is also preferred when profit margin is added to each regression. 

There is an important limitation in using R? to choose between nonnested models: we cannot 
use it to choose between different functional forms for the dependent variable. This is unfortunate, 
because we often want to decide on whether y or log(y) (or maybe some other transformation) should 
be used as the dependent variable based on goodness-of-fit. But neither R? nor R? can be used for 
this purpose. The reason is simple: these R-squareds measure the explained proportion of the total 
variation in whatever dependent variable we are using in the regression, and different nonlinear func- 
tions of the dependent variable will have different amounts of variation to explain. For example, the 
total variations in y and log(y) are not the same and 

GOING FURTHER 6.4 are often very different. Comparing the adjusted 
Explain why choosing a model by maximiz- R-squareds from regressions with these different 
ing R? or minimizing & (the standard error of forms of the dependent variables does not tell us 
the regression) is the same thing. anything about which model fits better; they are fit- 
ting two separate dependent variables. 


EXAMPLE 6.4 CEO Compensation and Firm Performance 


Consider two estimated models relating CEO compensation to firm performance: 


Pea 
salary = 830.63 + .0163 sales + 19.63 roe 
(223.90) (.0089) (11.08) [6.25] 
n = 209, R? = .029, R? = .020 
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and 


aS, 
lsalary = 4.36 + .275 Isales + .0179 roe 
(0.29) (.033) (.0040) [6.26] 
n = 209, R? = .282, R = .275, 
where roe is the return on equity discussed in Chapter 2. For simplicity, /salary and lsales denote the 
natural logs of salary and sales. We already know how to interpret these different estimated equa- 
tions. But can we say that one model fits better than the other? 
The R-squared for equation (6.25) shows that sales and roe explain only about 2.9% of the varia- 
tion in CEO salary in the sample. Both sales and roe have marginal statistical significance. 
Equation (6.26) shows that log(sales) and roe explain about 28.2% of the variation in log(salary). 
In terms of goodness-of-fit, this much higher R-squared would seem to imply that model (6.26) is 
much better, but this is not necessarily the case. The total sum of squares for salary in the sample 
is 391,732,982, while the total sum of squares for log(salary) is only 66.72. Thus, there is much less 
variation in log(salary) that needs to be explained. 
At this point, we can use features other than R? or R? to decide between these models. For exam- 
ple, log(sales) and roe are much more statistically significant in (6.26) than are sales and roe in (6.25), 
and the coefficients in (6.26) are probably of more interest. To be sure, however, we will need to make 
a valid goodness-of-fit comparison. 


In Section 6-4, we will offer a goodness-of-fit measure that does allow us to compare models 
where y appears in both level and log form. 


6-3c Controlling for Too Many Factors in Regression Analysis 


In many of the examples we have covered, and certainly in our discussion of omitted variables bias in 
Chapter 3, we have worried about omitting important factors from a model that might be correlated with 
the independent variables. It is also possible to control for too many variables in a regression analysis. 

If we overemphasize goodness-of-fit, we open ourselves to controlling for factors in a regression 
model that should not be controlled for. To avoid this mistake, we need to remember the ceteris pari- 
bus interpretation of multiple regression models. 

To illustrate this issue, suppose we are doing a study to assess the impact of state beer taxes on 
traffic fatalities. The idea is that a higher tax on beer will reduce alcohol consumption, and likewise 
drunk driving, resulting in fewer traffic fatalities. To measure the ceteris paribus effect of taxes on 
fatalities, we can model fatalities as a function of several factors, including the beer tax: 


fatalities = By + B\tax + B.miles + Bspercmale + Bypercl6_21 + ---> 


where 


miles = total miles driven. 
percmale = percentage of the state population that is male. 
percl6_21 = percentage of the population between ages 16 and 21, and so on. 


Notice how we have not included a variable measuring per capita beer consumption. Are we 
committing an omitted variables error? The answer is no. If we control for beer consumption in this 
equation, then how would beer taxes affect traffic fatalities? In the equation 


fatalities = By + Btax + Bsbeercons + > 
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Bı measures the difference in fatalities due to a one percentage point increase in tax, holding beercons 
fixed. It is difficult to understand why this would be interesting. We should not be controlling for dif- 
ferences in beercons across states, unless we want to test for some sort of indirect effect of beer taxes. 
Other factors, such as gender and age distribution, should be controlled for. 

As a second example, suppose that, for a developing country, we want to estimate the effects of 
pesticide usage among farmers on family health expenditures. In addition to pesticide usage amounts, 
should we include the number of doctor visits as an explanatory variable? No. Health expenditures 
include doctor visits, and we would like to pick up all effects of pesticide use on health expenditures. 
If we include the number of doctor visits as an explanatory variable, then we are only measuring the 
effects of pesticide use on health expenditures other than doctor visits. It makes more sense to use 
number of doctor visits as a dependent variable in a separate regression on pesticide amounts. 

The previous examples are what can be called over controlling for factors in multiple regression. 
Often this results from nervousness about potential biases that might arise by leaving out an important 
explanatory variable. But it is important to remember the ceteris paribus nature of multiple regression. 
In some cases, it makes no sense to hold some factors fixed precisely because they should be allowed 
to change when a policy variable changes. 

Unfortunately, the issue of whether or not to control for certain factors is not always clear-cut. 
For example, Betts (1995) studies the effect of high school quality on subsequent earnings. He points 
out that, if better school quality results in more education, then controlling for education in the regres- 
sion along with measures of quality will underestimate the return to quality. Betts does the analysis 
with and without years of education in the equation to get a range of estimated effects for quality of 
schooling. 

To see explicitly how pursuing high R-squareds can lead to trouble, consider the housing price 
example from Section 4-5 that illustrates the testing of multiple hypotheses. In that case, we wanted to 
test the rationality of housing price assessments. We regressed log(price) on log(assess), log(lotsize), 
log(sqrft), and bdrms and tested whether the latter three variables had zero population coefficients 
while log(assess) had a coefficient of unity. But what if we change the purpose of the analysis and 
estimate a hedonic price model, which allows us to obtain the marginal values of various housing 
attributes? Should we include log(assess) in the equation? The adjusted R-squared from the regres- 
sion with log(assess) is .762, while the adjusted R-squared without it is .630. Based on goodness- 
of-fit only, we should include log(assess). But this is incorrect if our goal is to determine the effects 
of lot size, square footage, and number of bedrooms on housing values. Including log(assess) in the 
equation amounts to holding one measure of value fixed and then asking how much an additional 
bedroom would change another measure of value. This makes no sense for valuing housing attributes. 

If we remember that different models serve different purposes, and we focus on the ceteris pari- 
bus interpretation of regression, then we will not include the wrong factors in a regression model. 


6-3d Adding Regressors to Reduce the Error Variance 


We have just seen some examples of where certain independent variables should not be included in 
a regression model, even though they are correlated with the dependent variable. From Chapter 3, 
we know that adding a new independent variable to a regression can exacerbate the multicollinearity 
problem. On the other hand, because we are taking something out of the error term, adding a variable 
generally reduces the error variance. Generally, we cannot know which effect will dominate. 

However, there is one case that is clear: we should always include independent variables that 
affect y and are uncorrelated with all of the independent variables of interest. Why? Because adding 
such a variable does not induce multicollinearity in the population (and therefore multicollinearity in 
the sample should be negligible), but it will reduce the error variance. In large sample sizes, the stan- 
dard errors of all OLS estimators will be reduced. 

As an example, consider estimating the individual demand for beer as a function of the average 
county beer price. It may be reasonable to assume that individual characteristics are uncorrelated with 
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county-level prices, and so a simple regression of beer consumption on county price would suffice for 
estimating the effect of price on individual demand. But it is possible to get a more precise estimate 
of the price elasticity of beer demand by including individual characteristics, such as age and amount 
of education. If these factors affect demand and are uncorrelated with price, then the standard error of 
the price coefficient will be smaller, at least in large samples. 

As a second example, consider the grants for computer equipment given at the beginning of 
Section 6-3. If, in addition to the grant variable, we control for other factors that can explain college 
GPA, we can probably get a more precise estimate of the effect of the grant. Measures of high school 
grade point average and rank, SAT and ACT scores, and family background variables are good can- 
didates. Because the grant amounts are randomly assigned, all additional control variables are uncor- 
related with the grant amount; in the sample, multicollinearity between the grant amount and other 
independent variables should be minimal. But adding the extra controls might significantly reduce 
the error variance, leading to a more precise estimate of the grant effect. Remember, the issue is not 
unbiasedness here: we obtain an unbiased and consistent estimator whether or not we add the high 
school performance and family background variables. The issue is getting an estimator with a smaller 
sampling variance. 

A related point is that when we have random assignment of a policy, we need not worry about 
whether some of our explanatory variables are “endogenous” —provided these variables themselves 
are not affected by the policy. For example, in studying the effect of hours in a job training program 
on labor earnings, we can include the amount of education reported prior to the job training program. 
We need not worry that schooling might be correlated with omitted factors, such as “ability,” because 
we are not trying to estimate the return to schooling. We are trying to estimate the effect of the job 
training program, and we can include any controls that are not themselves affected by job training 
without biasing the job training effect. What we must avoid is including a variable such as the amount 
of education after the job training program, as some people may decide to get more education because 
of how many hours they were assigned to the job training program. 

Unfortunately, cases where we have information on additional explanatory variables that are 
uncorrelated with the explanatory variables of interest are somewhat rare in the social sciences. But 
it is worth remembering that when these variables are available, they can be included in a model to 
reduce the error variance without inducing multicollinearity. 


6-4 Prediction and Residual Analysis 


In Chapter 3, we defined the OLS predicted or fitted values and the OLS residuals. Predictions are 
certainly useful, but they are subject to sampling variation, because they are obtained using the OLS 
estimators. Thus, in this section, we show how to obtain confidence intervals for a prediction from the 
OLS regression line. 

From Chapters 3 and 4, we know that the residuals are used to obtain the sum of squared residu- 
als and the R-squared, so they are important for goodness-of-fit and testing. Sometimes, economists 
study the residuals for particular observations to learn about individuals (or firms, houses, etc.) in the 
sample. 


6.4a Confidence Intervals for Predictions 


Suppose we have estimated the equation 
= Bo + Bux, + Box, Tee Bx [6.27] 


When we plug in particular values of the independent variables, we obtain a prediction for y, which 
is an estimate of the expected value of y given the particular values for the explanatory variables. 
For emphasis, let c,, c2, . . . , Cg denote particular values for each of the k independent variables; these 
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may or may not correspond to an actual data point in our sample. The parameter we would like to 
estimate is 


Oo = Bo + Bici + Boca +++ + Byeg 
E(ylx, = cy, Xp = Coy. Xp = CG). 


[6.28] 


The estimator of 6 is 
ĝo = Êo F Bic, a Bo + + Bein [6.29] 


In practice, this is easy to compute. But what if we want some measure of the uncertainty in this pre- 
dicted value? It is natural to construct a confidence interval for 6), which is centered at Bo. 

To obtain a confidence interval for 6), we need a standard error for Go. Then, with a large df, we 
can construct a 95% confidence interval using the rule of thumb ĝo $ 2-se(Oy). (As always, we can 
use the exact percentiles in a ¢ distribution.) 

How do we obtain the standard error of ĝo? This is the same problem we encountered in 
Section 4-4: we need to obtain a standard error for a linear combination of the OLS estimators. Here, 
the problem is even more complicated, because all of the OLS estimators generally appear in ĝo 
(unless some c; are zero). Nevertheless, the same trick that we used in Section 4-4 will work here. 
Write By = 6) — Bic, — + — B,c; and plug this into the equation 


y 5 Bo + Bix +o + Bix +u 
to obtain 
y = bo + Bil, — c1) + Bol = c2) ++ + Billy — ck) + u. [6.30] 
In other words, we subtract the value c; from each observation on x;, and then we run the regression of 
y; on (xa — c), ..., (Xe — ct = 1,2,..., 0. [6.31] 


The predicted value in (6.29) and, more importantly, its standard error, are obtained from the intercept 
(or constant) in regression (6.31). 

As an example, we obtain a confidence interval for a prediction from a college GPA regression, 
where we use high school information. 


Confidence Interval for Predicted College GPA 


Using the data in GPA2, we obtain the following equation for predicting college GPA: 


colgpa = 1.493 + .00149 sat — .01386 hsperc 
(0.075) (.00007)  (.00056) 
— .06088 hsize + .00546 hsize? [6.32] 
(.01650) (.00227) 
n = 4,137, R? = .278, R? = 277, 6 = .560, 


where we have reported estimates to several digits to reduce round-off error. What is predicted col- 
lege GPA, when sat = 1,200, hsperc = 30, and hsize = 5 (which means 500)? This is easy to get by 
plugging these values into equation (6.32): colgpa = 2.70 (rounded to two digits). Unfortunately, we 
cannot use equation (6.32) directly to get a confidence interval for the expected colgpa at the given 
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values of the independent variables. One simple way to obtain a confidence interval is to define a new 
set of independent variables: satO = sat — 1,200, hspercO = hsperc — 30, hsizeO = hsize — 5, and 
hsizesqO = hsize? — 25. When we regress colgpa on these new independent variables, we get 


colgpa = 2.700 + .00149 sat0 — .01386 hspercO 
(0.020) (.00007) (.00056) 
— .06088 hsizeO + .00546 hsizesqO 
(.01650) (.00227) 
n = 4,137, R? = .278, R? = 277, 6 = .560. 


The only difference between this regression and that in (6.32) is the intercept, which is the predic- 
tion we want, along with its standard error, .020. It is not an accident that the slope coefficients, their 
standard errors, R-squared, and so on are the same as before; this provides a way to check that the 
proper transformations were done. We can easily construct a 95% confidence interval for the expected 
college GPA: 2.70 + 1.96(.020) or about 2.66 to 2.74. This confidence interval is rather narrow due to 
the very large sample size. 


Because the variance of the intercept estimator is smallest when each explanatory variable has 
zero sample mean (see Problem 10, part (iv) in Chapter 2 for the simple regression case), it follows 
from the regression in (6.31) that the variance of the prediction is smallest at the mean values of the x;. 
(That is, c; = x; for all j.) This result is not too surprising, as we have the most faith in our regression 
line near the middle of the data. As the values of the c; get farther away from the x;, Var(4) gets larger 


and larger. 
The previous method allows us to put a confidence interval around the OLS estimate of 
E(y|x),..., x.) for any values of the explanatory variables. In other words, we obtain a confidence 


interval for the average value of y for the subpopulation with a given set of covariates. But a confi- 
dence interval for the average person in the subpopulation is not the same as a confidence interval for 
a particular unit (individual, family, firm, and so on) from the population. In forming a confidence 
interval for an unknown outcome on y, we must account for another very important source of varia- 
tion: the variance in the unobserved error, which measures our ignorance of the unobserved factors 
that affect y. 

Let y° denote the value for which we would like to construct a confidence interval, which we 
sometimes call a prediction interval. For example, y° could represent a person or firm not in our 
original sample. Let x, ..., x? be the new values of the independent variables, which we assume we 
observe, and let u? be the unobserved error. Therefore, we have 


y’ = Bo + Bix + Box ee at Ber +w. [6.33] 


As before, our best prediction of y° is the expected value of y e the E aneety variables, which 
we estimate from the q regression line: $°? = Bo + Byx° +F x2 + n Bx. The prediction error 
in using $° to predict y° is 


e = y — S = (Bo + Bixi ++ + Berg) + uP D. [6.34] 


Now, E(5°) = E(By) + E(B,)x? + E(By)x2 + +: + E(B,)xe = Bo + Bix? ++ + Bx, because 
the Ê; are unbiased. (As before, these expectations are all conditional on the sample values of the 
independent variables.) Because u’ has zero mean, E(ê?) = 0. We have shown that the expected 
prediction error is zero. 

In finding the variance of ê, note that u° is uncorrelated with each Ê, j because u’ is uncorrelated with 
the errors in the sample used to obtain the Ê. };, By basic properties of covariance (see Math Refresher B), 
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u’ and $° are uncorrelated. Therefore, the variance of the prediction error (conditional on all 
in-sample values of the independent variables) is the sum of the variances: 


Var(é@°) = Var($°) + Var(u?) = Var($°) + o°, [6.35] 


where g? = Var(u°) is the error variance. There are two sources of variation in @°. The first is the 
sampling error in $°, which arises because we have estimated the B;. Because each Ê, has a vari- 
ance proportional to 1/n, where n is the sample size, Var($?) is proportional to 1/n. This means 
that, for large samples, Var($°) can be very small. By contrast, g? is the variance of the error in 
the population; it does not change with the sample size. In many examples, a? will be the domi- 
nant term in (6.35). 

Under the classical linear model assumptions, the Ê; and u? are normally distributed, and so @° is 
also normally distributed (conditional on all sample values of the explanatory variables). Earlier, we 
described how to obtain an unbiased estimator of Var($°), and we obtained our unbiased estimator of 
a° in Chapter 3. By using these estimators, we can define the standard error of @° as 


se(@°) = {[se($°) P + &7}1/?. [6.36] 


Using the same reasoning for the t statistics of the Bi é°/se(é°) has a t distribution with n — (k + 1) 
degrees of freedom. Therefore, 


P[— tos = ê/se(ê?) = tos] = .95, 


where fos is the 97.5" percentile in the ¢,_,_, distribution. For large n — k — 1, remember that 
tors = 1.96. Plugging in 2° = y? — $° and rearranging gives a 95% prediction interval for y°: 


3° © tops-se(2°); [6.37] 


as usual, except for small df, a good rule of thumb is }° + 2se(é°). This is wider than the confidence 
interval for 3 itself because of G in (6.36); it often is much wider to reflect the factors in u° that we 
have not accounted for. 


EXAMPLE 6.6 Confidence Interval for Future College GPA 


Suppose we want a 95% confidence interval for the future college GPA of a high school student with 
sat = 1,200, hsperc = 30, and hsize = 5. In Example 6.5, we obtained a 95% CI for the average 
college GPA among all students with the particular characteristics sat = 1,200, hsperc = 30, and 
hsize = 5. Now, we want a 95% CI for any particular student with these characteristics. The 95% 
prediction interval must account for the variation in the individual, unobserved characteristics that 
affect college performance. We have everything we need to obtain a CI for colgpa. se(#°) = .020 and 
& = .560 and so, from (6.36), se(@°) = [(.020)? + (.560)?]'/? ~= .560. Notice how small se($°) is 
relative to ĉ: virtually all of the variation in ê? comes from the variation in u°. The 95% CI is 2.70 + 
1.96(.560) or about 1.60 to 3.80. This is a wide confidence interval and shows that, based on the fac- 
tors we included in the regression, we cannot accurately pin down an individual’s future college grade 
point average. (In one sense, this is good news, as it means that high school rank and performance 
on the SAT do not preordain one’s performance in college.) Evidently, the unobserved characteristics 
that affect college GPA vary widely among individuals with the same observed SAT score and high 
school rank. 
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6-4b Residual Analysis 


Sometimes, it is useful to examine individual observations to see whether the actual value of the 
dependent variable is above or below the predicted value; that is, to examine the residuals for the 
individual observations. This process is called residual analysis. Economists have been known to 
examine the residuals from a regression in order to aid in the purchase of a home. The following 
housing price example illustrates residual analysis. Housing price is related to various observable 
characteristics of the house. We can list all of the characteristics that we find important, such as size, 
number of bedrooms, number of bathrooms, and so on. We can use a sample of houses to estimate a 
relationship between price and attributes, where we end up with a predicted value and an actual value 
for each house. Then, we can construct the residuals, #; = y; — ĵ;. The house with the most negative 
residual is, at least based on the factors we have controlled for, the most underpriced one relative to its 
observed characteristics. Of course, a selling price substantially below its predicted price could indi- 
cate some undesirable feature of the house that we have failed to account for, and which is therefore 
contained in the unobserved error. In addition to obtaining the prediction and residual, it also makes 
sense to compute a confidence interval for what the future selling price of the home could be, using 
the method described in equation (6.37). 

Using the data in HPRICE1, we run a regression of price on lotsize, sqrft, and bdrms. In the 
sample of 88 homes, the most negative residual is —120.206, for the 81“ house. Therefore, the asking 
price for this house is $120,206 below its predicted price. 

There are many other uses of residual analysis. One way to rank law schools is to regress median 
starting salary on a variety of student characteristics (such as median LSAT scores of entering class, 
median college GPA of entering class, and so on) and to obtain a predicted value and residual for 
each law school. The law school with the largest residual has the highest predicted value added. (Of 
course, there is still much uncertainty about how an individual’s starting salary would compare with 
the median for a law school overall.) These residuals can be used along with the costs of attending 
each law school to determine the best value; this would require an appropriate discounting of future 
earnings. 

Residual analysis also plays a role in legal decisions. A New York Times article entitled “Judge 
Says Pupil’s Poverty, Not Segregation, Hurts Scores” (6/28/95) describes an important legal case. 
The issue was whether the poor performance on standardized tests in the Hartford School District, 
relative to performance in surrounding suburbs, was due to poor school quality at the highly segre- 
gated schools. The judge concluded that “the disparity in test scores does not indicate that Hartford 
is doing an inadequate or poor job in educating its students or that its schools are failing, because the 
predicted scores based upon the relevant socioeconomic factors are about at the levels that one would 
expect.” This conclusion is based on a regression analysis of average or median scores on socioeco- 
nomic characteristics of various school districts in 
Connecticut. The judge’s conclusion suggests that, 

GOING FURTHER 6.5 given the poverty levels of students at Hartford 
How would you use residual analysis to schools, the actual test scores were similar to those 
determine which professional athletes | Predicted from a regression analysis: the residual 
are overpaid or underpaid relative to their for Hartford was not sufficiently negative to con- 
performance? clude that the schools themselves were the cause of 
low test scores. 


6-4c Predicting y When log(y) Is the Dependent Variable 


Because the natural log transformation is used so often for the dependent variable in empirical eco- 
nomics, we devote this subsection to the issue of predicting y when log(y) is the dependent variable. 
As a byproduct, we will obtain a goodness-of-fit measure for the log model that can be compared with 
the R-squared from the level model. 
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To obtain a prediction, it is useful to define logy = log(y); this emphasizes that it is the log of y 
that is predicted in the model 


logy = Bo + Bix, + Boxy +--+ + Byxy + u. [6.38] 


In this equation, the x; might be transformations of other variables; for example, we could have 
x, = log(sales), x, = log(mktval), x; = ceoten in the CEO salary example. 
Given the OLS estimators, we know how to predict /ogy for any value of the independent variables: 


logy = Bo + Êx + Bax ee a Baxr [6.39] 


Now, because the exponential undoes the log, our first guess for predicting y is to simply exponenti- 
ate the predicted value for log(y): $ = exp(Togy). This does not work; in fact, it will systematically 
underestimate the expected value of y. In fact, if model (6.38) follows the CLM assumptions MLR.1 
through MLR.6, it can be shown that 


E(ylļx) = exp(a7/2)-exp(Bo + Bixi + Boxy Fo + Bex); 


where x denotes the independent variables and o” is the variance of u. [Jf u ~ Normal(0,o7), then the 
expected value of exp(w) is exp(a?/2).] This equation shows that a simple adjustment is needed to 
predict y: 


$ = exp(6/2)exp(logy), [6.40] 


where G” is simply the unbiased estimator of o°. Because ĉ, the standard error of the regression, is 
always reported, obtaining predicted values for y is easy. Because G6? > 0, exp(G7/2) > 1. For large 
ô’, this adjustment factor can be substantially larger than unity. 

The prediction in (6.40) is not unbiased, but it is consistent. There are no unbiased predictions of 
y, and in many cases, (6.40) works well. However, it does rely on the normality of the error term, u. 
In Chapter 5, we showed that OLS has desirable properties, even when u is not normally distributed. 
Therefore, it is useful to have a prediction that does not rely on normality. If we just assume that u is 
independent of the explanatory variables, then we have 


E(y|x) = aexp(Bo + Bixi + Box. +o + Bixi), [6.41] 


where ap is the expected value of exp(u), which must be greater than unity. 
Given an estimate @, we can predict y as 


$ = dexp(logy), [6.42] 


which again simply requires exponentiating the predicted value from the log model and multiplying 
the result by âo. 

Two approaches suggest themselves for estimating a without the normality assumption. The first is 
based on ay = E[exp(u) |. To estimate ay we replace the population expectation with a sample average and 
then we replace the unobserved errors, u;, with the OLS residuals, 7; = log(y;) Bo Bixa ve BX 
This leads to the method of moments estimator (see Math Refresher C) 


& = n! explû,;). [6.43] 
=i 


Not surprisingly, @ is a consistent estimator of a, but it is not unbiased because we have replaced 
u; With ĝ; inside a nonlinear function. This version of âg is a special case of what Duan (1983) called 
a smearing estimate. Because the OLS residuals have a zero sample average, it can be shown that, 
for any data set, @ > 1. (Technically, &) would equal one if all the OLS residuals were zero, but this 
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never happens in any interesting application.) That @ is necessarily greater than one is convenient 
because it must be that ay > 1. 

A different estimate of a, is based on a simple regression through the origin. To see how it works, 
define m; = exp(B) + Bixa +- + Bx), so that, from equation (6.41), E(y,|m;) = agm;. If we 
could observe the m; we could obtain an unbiased estimator of a, from the regression y; on m; without 
an intercept. Instead, we replace the 6, with their OLS estimates and obtain ñ; = exp(logy,), where, 
of course, the logy; are the fitted values from the regression logy; on x;;,.. . , Xj (with an intercept). 
Then dp [to distinguish it from @ in equation (6.43)] is the OLS slope estimate from the simple 
regression y; on ñ; (no intercept): 


n -1 n 
Čo = ( > ai) ( > any) [6.44] 
i=l i=1 


We will call a the regression estimate of ap. Like @, & is consistent but not unbiased. Interestingly, 
Čo is not guaranteed to be greater than one, although it will be in most applications. If @ is less than 
one, and especially if it is much less than one, it is likely that the assumption of independence between 
u and the x; is violated. If & < 1, one possibility is to just use the estimate in (6.43), although this 
may simply be masking a problem with the linear model for log(y). 

We summarize the steps: 


6-4d Predicting y When the Dependent Variable Is log(y) 


1. Obtain the fitted values, logy and residuals, ĉ;, from the regression logy on x), ... , Xp 
2. Obtain @ as in equation (6.43) or @ in equation (6.44). 

3. For given values of x, . . . , Xy, obtain logy from (6.42). 

4. Obtain the prediction } from (6.42) (with @ or čo). 


We now show how to predict CEO salaries using this procedure. 


Predicting CEO Salaries 


The model of interest is 
log(salary) = By + B,log(sales) + B,log(mktval) + Bzceoten + u, 


so that 6; and ß, are elasticities and 100-6, is a semi-elasticity. The estimated equation using 
CEOSAL2? is 


isalary = 4.504 + .163 sales + .109 Imktval + .0117 ceoten 
(.257) (.039) (.050) (.0053) [6.45] 
n = 177, R = 318, 
where, for clarity, we let /salary denote the log of salary, and similarly for /sales and Imktval. Next, 
we obtain ñ; = exp(Isalary;) for each observation in the sample. 

The Duan smearing estimate from (6.43) is about @ = 1.136, and the regression estimate from 
(6.44) is & = 1.117. We can use either estimate to predict salary for any values of sales, mktval, and 
ceoten. Let us find the prediction for sales = 5,000 (which means $5 billion because sales is in mil- 
lions), mktval = 10,000 (or $10 billion), and ceoten = 10. From (6.45), the prediction for /salary is 
4.504 + .163-log(5,000) + .109-log(10.000) + .0117(10) = 7.013, and exp(7.013) = 1,110.983. 
Using the estimate of a, from (6.43), the predicted salary is about 1,262.077, or $1,262,077. Using 
the estimate from (6.44) gives an estimated salary of about $1,240,968. These differ from each other 
by much less than each differs from the naive prediction of $1,110,983. 
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We can use the previous method of obtaining predictions to determine how well the model with 
log(y) as the dependent variable explains y. We already have measures for models when y is the 
dependent variable: the R-squared and the adjusted R-squared. The goal is to find a goodness-of-fit 
measure in the log(y) model that can be compared with an R-squared from a model where y is the 
dependent variable. 

There are different ways to define a goodness-of-fit measure after retransforming a model for 
log(y) to predict y. Here we present two approaches that are easy to implement. The first gives the 
same goodness-of-fit measures whether we estimate ap as in (6.40), (6.43), or (6.44). To motivate the 
measure, recall that in the linear regression equation estimated by OLS, 


9 = By + Bix, +--+ Bow [6.46] 


the usual R-squared is simply the square of the correlation between y; and ĵ; (see Section 3-2). Now, if 
instead we compute fitted values from (6.42)—that is, ĵ; = @m; for all observations i—then it makes 
sense to use the square of the correlation between y; and these fitted values as an R-squared. Because 
correlation is unaffected if we multiply by a constant, it does not matter which estimate of a) we use. 
In fact, this R-squared measure for y [not log(y)] is just the squared correlation between y; and m;. We 
can compare this directly with the R-squared from equation (6.46). 

The squared correlation measure does not depend on how we estimate a). A second approach is 
to compute an R-squared for y based on a sum of squared residuals. For concreteness, suppose we use 
equation (6.43) to estimate ay. Then the residual for predicting y; is 


7, = y; — dyexp(logy,). [6.47] 


and we can use these residuals to compute a sum of squared residuals. Using the formula for R-squared 
from linear regression, we are led to 
Xi- 

i X-i; - y) ai 
as an alternative goodness-of-fit measure that can be compared with the R-squared from the linear 
model for y. Notice that we can compute such a measure for the alternative estimates of a in equation 
(6.40) and (6.44) by inserting those estimates in place of âo in (6.47). Unlike the squared correlation 
between y; and ñ;, the R-squared in (6.48) will depend on how we estimate a. The estimate that mini- 
mizes $; is that in equation (6.44), but that does not mean we should prefer it (and certainly not if 
Čo < 1). We are not really trying to choose among the different estimates of ao; rather, we are finding 
goodness-of-fit measures that can be compared with the linear model for y. 


EXAMPLE 6.8 Predicting CEO Salaries 


After we obtain the ñ;, we just obtain the correlation between salary; and ñ; it is .493. The square of 
it is about .243, and this is a measure of how well the log model explains the variation in salary, not 
log(salary). [The R? from (6.45), .318, tells us that the log model explains about 31.8% of the varia- 
tion in log(salary).] 

As a competing linear model, suppose we estimate a model with all variables in levels: 


salary = Bo + B,sales + B mktval + B3ceoten + u. [6.49] 


The key is that the dependent variable is salary. We could use logs of sales or mktval on the 
right-hand side, but it makes more sense to have all dollar values in levels if one (salary) appears as 
a level. The R-squared from estimating this equation using the same 177 observations is .201. Thus, 
the log model explains more of the variation in salary, and so we prefer it to (6.49) on goodness-of-fit 
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grounds. The log model is also preferred because it seems more realistic and its parameters are easier 
to interpret. 

If we maintain the full set of classical linear model assumptions in the model (6.38), we can eas- 
ily obtain prediction intervals for y? = exp(B + Bix? + ++: + Bx? + u?) when we have estimated a 
linear model for log(y). Recall that x?, x9, . . . , x? are known values and u? is the unobserved error that 
partly determines y°. From equation (6.37), a 95% prediction interval for logy? = log(y°) is simply 
Togy® = tos - se(2°), where se(2°) is obtained from the regression of log(y) on x,,..., x, using the 
original n observations. Let c, = Togy® —tos: se(2°) and c, = Togy® + to; se(2°) be the lower and 
upper bounds of the prediction interval for logy’. That is, P(c, = logy? = c,) = .95. Because the expo- 
nential function is strictly increasing, it is also true that P[exp(c,) = exp(logy°) = exp(c,)] = .95, 
that is, Plexp(c,) = y? = exp(c,) ] = .95. Therefore, we can take exp(c,) and exp(c,,) as the lower and 
upper bounds, respectively, for a 95% prediction interval for y°. For large n, toos = 1.96, and so a 95% 
prediction interval for y’ is exp[ —1.96 - se(@°) ]exp(B) + x°B) to exp[—1.96-se(2°) ]exp(By + x°B), 
where x° is shorthand for B,x? + --- + B,x?. Remember, the Ê; and se(2°) are obtained from the 
regression with log(y) as the dependent variable. Because we assume normality of u in (6.38), we 
probably would use (6.40) to obtain a point prediction for y°. Unlike in equation (6.37), this point pre- 
diction will not lie halfway between the lower and upper bounds exp(c,) and exp(c,,). One can obtain 
different 95% prediction intervalues by choosing different quantiles in the ¢,_,_, distribution. If q,, 
and qa are quantiles with a, — a, = .95, then we can choose c, = qg,,Se(@°) and c, = qgrse(@°). 

As an example, consider the CEO salary regression, where we make the prediction at the 
same values of sales, mktval, and ceoten as in Example 6.7. The standard error of the regression 
for (6.43) is about .505, and the standard error of Jogy® is about .075. Therefore, using equation 
(6.36), se(é°) = .511; as in the GPA example, the error variance swamps the estimation error in the 
parameters, even though here the sample size is only 177. A 95% prediction interval for salary” is 
exp[—1.96 - (.511)] exp(7.013) to exp[1.96 - (.511)] exp(7.013), or about 408.071 to 3,024.678, 
that is, $408,071 to $3,024,678. This very wide 95% prediction interval for CEO salary at the given 
sales, market value, and tenure values shows that there is much else that we have not included in the 
regression that determines salary. Incidentally, the point prediction for salary, using (6.40), is about 
$1,262,075—higher than the predictions using the other estimates of a) and closer to the lower bound 
than the upper bound of the 95% prediction interval. 


Summary 


In this chapter, we have covered some important multiple regression analysis topics. 

Section 6-1 showed that a change in the units of measurement of an independent variable changes 
the OLS coefficient in the expected manner: if x; is multiplied by c, its coefficient is divided by c. If the 
dependent variable is multiplied by c, all OLS coefficients are multiplied by c. Neither t nor F statistics are 
affected by changing the units of measurement of any variables. 

We discussed beta coefficients, which measure the effects of the independent variables on the depen- 
dent variable in standard deviation units. The beta coefficients are obtained from a standard OLS regression 
after the dependent and independent variables have been transformed into z-scores. 

We provided a detailed discussion of functional form, including the logarithmic transformation, qua- 
dratics, and interaction terms. It is helpful to summarize some of our conclusions. 


CONSIDERATIONS WHEN USING LOGARITHMS 


1. The coefficients have percentage change interpretations. We can be ignorant of the units of measure- 
ment of any variable that appears in logarithmic form, and changing units from, say, dollars to thou- 
sands of dollars has no effect on a variable’s coefficient when that variable appears in logarithmic 
form. 
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Pe 


Logs are often used for dollar amounts that are always positive, as well as for variables such as popu- 
lation, especially when there is a lot of variation. They are used less often for variables measured in 
years, such as schooling, age, and experience. Logs are used infrequently for variables that are already 
percents or proportions, such as an unemployment rate or a pass rate on a test. 

Models with log(y) as the dependent variable often more closely satisfy the classical linear model 
assumptions. For example, the model has a better chance of being linear, homoskedasticity is more 
likely to hold, and normality is often more plausible. 

In many cases, taking the log greatly reduces the variation of a variable, making OLS estimates less 
prone to outlier influence. However, in cases where y is a fraction and close to zero for many observa- 
tions, log(y;) can have much more variability than y;. For values y; very close to zero, log(y;) is a nega- 
tive number very large in magnitude. 

If y = 0 but y = O is possible, we cannot use log(y). Sometimes log(1 + y) is used, but interpretation 
of the coefficients is difficult. 

For large changes in an explanatory variable, we can compute a more accurate estimate of the percent- 
age change effect. 

It is harder (but possible) to predict y when we have estimated a model for log(y). 


CONSIDERATIONS WHEN USING QUADRATICS 


1. 
2. 
3. 


A quadratic function in an explanatory variable allows for an increasing or decreasing effect. 

The turning point of a quadratic is easily calculated, and it should be calculated to see if it makes sense. 
Quadratic functions where the coefficients have the opposite sign have a strictly positive turning point; 
if the signs of the coefficients are the same, the turning point is at a negative value of x. 

A seemingly small coefficient on the square of a variable can be practically important in what it 
implies about a changing slope. One can use a f test to see if the quadratic is statistically significant, 
and compute the slope at various values of x to see if it is practically important. 

For a model quadratic in a variable x, the coefficient on x measures the partial effect starting from x = 0, 
as can be seen in equation (6.11). If zero is not a possible or interesting value of x, one can center x 
about a more interesting value, such as the average in the sample, before computing the square. This is 
the same as computing the average partial effect. Computing Exercise C12 provides an example. 


CONSIDERATIONS WHEN USING INTERACTIONS 


1. 


2. 


Interaction terms allow the partial effect of an explanatory variable, say x,, to depend on the level of 
another variable, say x,—and vice versa. 

Interpreting models with interactions can be tricky. The coefficient on x,, say 64, measures the partial 
effect of x; on y when x, = 0, which may be impossible or uninteresting. Centering x, and x, around 
interesting values before constructing the interaction term typically leads to an equation that is visually 
more appealing. When the variables are centered about their sample averages before multiplying them 
together to create the interaction, the coefficients on the levels become estimated average partial effects. 
A standard ż test can be used to determine if an interaction term is statistically significant. Computing 
the partial effects at different values of the explanatory variables can be used to determine the practical 
importance of interactions. 


We introduced the adjusted R-squared, R’, as an alternative to the usual R-squared for measuring good- 


ness-of-fit. Whereas R? can never fall when another variable is added to a regression, R? penalizes the num- 
ber of regressors and can drop when an independent variable is added. This makes R? preferable for choosing 
between nonnested models with different numbers of explanatory variables. Neither R? nor R? can be used to 
compare models with different dependent variables. Nevertheless, it is fairly easy to obtain goodness-of-fit 
measures for choosing between y and log(y) as the dependent variable, as shown in Section 6-4. 
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In Section 6-3, we discussed the somewhat subtle problem of relying too much on R? or R? in arriving 
at a final model: it is possible to control for too many factors in a regression model. For this reason, it is 
important to think ahead about model specification, particularly the ceteris paribus nature of the multiple 
regression equation. Explanatory variables that affect y and are uncorrelated with all the other explanatory 
variables can be used to reduce the error variance without inducing multicollinearity. 

In Section 6-4, we demonstrated how to obtain a confidence interval for a prediction made from an 
OLS regression line. We also showed how a confidence interval can be constructed for a future, unknown 
value of y. 

Occasionally, we want to predict y when log(y) is used as the dependent variable in a regression 
model. Section 6-4 explains this simple method. Finally, we are sometimes interested in knowing about the 


sign and magnitude of the residuals for particular observations. Residual analysis can be used to determine 
whether particular members of the sample have predicted values that are well above or well below the 
actual outcomes. 


Key Terms 


Adjusted R-Squared Nonnested Models Quadratic Functions 

Average Partial Effect (APE) Over Controlling Resampling Method 

Beta Coefficients Population R-Squared Residual Analysis 

Bootstrap Prediction Error Smearing Estimate 

Bootstrap Standard Error Prediction Interval Standardized Coefficients 
Interaction Effect Predictions Variance of the Prediction Error 


Problems 


1 The following equation was estimated using the data in CEOSAL1: 


rie: 
log(salary) = 4.322 + .276 log(sales) + .0215 roe — .00008 roe? 
(.324) (.033) (.0129)  (.00026) 
n = 209, R? = .282. 


This equation allows roe to have a diminishing effect on log(salary). Is this generality necessary? 
Explain why or why not. 


2 Let Bos Bi eg , Bx be the OLS estimates from the regression of y; on Xi... Xj E = 1,2,...,n. For 
nonzero constants c),..., Cp, argue that the OLS intercept and slopes from the regression of coy; on 
CXib es GX C= 1,2,...,n, are given by By = cobo, By = (o/c) Bi, - - - , Be = (Co/cy) By. [Hint: 


Use the fact that the Ê; solve the first order conditions in (3.13), and the B, must solve the first order 
conditions involving the rescaled dependent and independent variables. ] 


3 Using the data in RDCHEM, the following equation was obtained by OLS: 
Fdintens = 2.613 + .00030 sales — .0000000070 sales? 


(.429) (.00014) (.0000000037 ) 
n = 32, R? = .1484. 


(i) At what point does the marginal effect of sales on rdintens become negative? 
(ii) Would you keep the quadratic term in the model? Explain. 
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(iii) Define salesbil as sales measured in billions of dollars: salesbil = sales/1,000. Rewrite the 
estimated equation with salesbil and salesbil’ as the independent variables. Be sure to report 
standard errors and the R-squared. [Hint: Note that salesbil’ = sales?/(1,000)*.] 

(iv) For the purpose of reporting the results, which equation do you prefer? 


4 The following model allows the return to education to depend upon the total amount of both parents’ 
education, called pareduc: 


log(wage) = By + Bieduc + B,educ-pareduc + B,exper + Bytenure + u. 
(i) | Show that, in decimal form, the return to another year of education in this model is 
Alog(wage)/Aeduc = B, + Bpareduc. 


What sign do you expect for B? Why? 
(ii) Using the data in WAGE2, the estimated equation is 


jog(wage) = 5.65 + .047 educ + .00078 educ-pareduc + 
(.13) (.010) (.00021) 
.019 exper + .010 tenure 
(.004) (.003) 
n = 722, R = .169. 


(Only 722 observations contain full information on parents’ education.) Interpret the coefficient 
on the interaction term. It might help to choose two specific values for pareduc—for example, 
pareduc = 32 if both parents have a college education, or pareduc = 24 if both parents have a 
high school education—and to compare the estimated return to educ. 

(iii) When pareduc is added as a separate variable to the equation, we get: 


Fo 
log(wage) = 4.94 + .097 educ + .033 pareduc — .0016 educ-pareduc 


(.38) (.027) (.017) (.0012) 
+ .020 exper + .010 tenure 
(.004) (.003) 


n = 722, R = .174. 


Does the estimated return to education now depend positively on parent education? Test the null 
hypothesis that the return to education does not depend on parent education. 


5 In Example 4.2, where the percentage of students receiving a passing score on a tenth-grade math 
exam (math10) is the dependent variable, does it make sense to include scil ]—the percentage of elev- 
enth graders passing a science exam—as an additional explanatory variable? 


6 When atndrte? and ACT-atndrte are added to the equation estimated in (6.19), the R-squared becomes 
.232. Are these additional terms jointly significant at the 10% level? Would you include them in the 
model? 


7 The following three equations were estimated using the 1,534 observations in 401K: 


prate = 80.29 + 5.44 mrate + .269 age — .00013 totemp 
(.78) (.52) (.045)  (.00004) 
R? = .100, R? = .098. 


8 


10 


11 
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prate = 97.32 + 5.02 mrate + .314 age — 2.66 log(totemp) 
(1.95) (0.51) (.044) (.28) 
R? = 144, R? = .142. 


prate = 80.62 + 5.34 mrate + .290 age — .00043 totemp 
(.78) (.52) (.045) (00009) 
+ .0000000039 totemp* 
(.00000000010) 
R? = 108, R? = .106. 


Which of these three models do you prefer? Why? 


Suppose we want to estimate the effects of alcohol consumption (alcohol) on college grade point aver- 

age (colGPA). In addition to collecting information on grade point averages and alcohol usage, we also 

obtain attendance information (say, percentage of lectures attended, called attend). A standardized test 

score (say, SAT) and high school GPA (hsGPA) are also available. 

(i) Should we include attend along with alcohol as explanatory variables in a multiple regression 
model? (Think about how you would interpret Bycoho1-) 

(ii) Should SAT and hsGPA be included as explanatory variables? Explain. 


If we start with (6.38) under the CLM assumptions, assume large n, and ignore the estimation error 

in the Ê; a 95% prediction interval for y° is [exp(—1.966) exp(logy®), exp(1.96&) exp( logy?) J. 

The point prediction for y° is $? = exp(6*/2)exp(logy’). 

(i) For what values of & will the point prediction be in the 95% prediction interval? Does this con- 
dition seem likely to hold in most applications? 

(i) Verify that the condition from part (i) is satisfied in the CEO salary example. 


The following two equations were estimated using the data in MEAPSINGLE. The key explanatory 
variable is /exppp, the log of expenditures per student at the school level. 


mathd = 24.49 + 9.01 lexppp — .422 free — .752 Imedinc — .274 pctsgle 
(59.24) (4.04) (071) (5.358) (.161) 
n = 229, R = 472, R = 462. 


mathd = 149.38 + 1.93 lexppp — .060 free — 10.78 Imedinc — .397 pctsgle + .667 read4 
(41.70) (2.82) (.054) (3.76) (.111) (.042) 
n = 229, R? = .749, R = .743. 


(i) Ifyou are a policy maker trying to estimate the causal effect of per-student spending on math 
test performance, explain why the first equation is more relevant than the second. What is the 
estimated effect of a 10% increase in expenditures per student? 

(ii) Does adding read4 to the regression have strange effects on coefficients and statistical signifi- 
cance other than Byexppp? 

(iii) How would you explain to someone with only basic knowledge of regression why, in this case, 
you prefer the equation with the smaller adjusted R-squared? 


Consider the equation 


y = Bo + Bix + Box +u 
E(ulx) = 0 


> 


where the explanatory variable x has a standard normal distribution in the population. In particular, 
E(x) = 0, E(x’) = Var(x) = 1, and E(x*) = 0. This last condition holds because the standard normal 
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distribution is symmetric about zero. We want to study what we can say about the OLS estimator if we 
omit x° and compute the simple regression estimator of the intercept and slope. 


G) 


Show that we can write 


y = &y + Bx +v. 


where E(v) = 0. In particular, find v and the new intercept, œg. 


(ii) 
(iii) 
(iv) 
(v) 


(vi) 


Show that E(v|x) depends on x unless B, = 0. 

Show that Cov(x, v) = 0. 

If B, is the slope coefficient from regression y; on x;, is Ê: consistent for B,? Is it unbiased? 
Explain. 

Argue that being able to estimate 6, has some value in the following sense: 6, is the partial 
effect of x on E(y|x)evaluated at x = 0, the average value of x. 

Explain why being able to consistently estimate 8, and 6, is more valuable than just estimating B,. 


Computer Exercises 


C1 Use the data in KIELMC, only for the year 1981, to answer the following questions. The data are for 
houses that sold during 1981 in North Andover, Massachusetts; 1981 was the year construction began 
on a local garbage incinerator. 


G) 


(ii) 


(iii) 


(iv) 


To study the effects of the incinerator location on housing price, consider the simple regression 
model 


log(price) = By + B,log(dist) + u, 


where price is housing price in dollars and dist is distance from the house to the incinerator 
measured in feet. Interpreting this equation causally, what sign do you expect for £, if the 
presence of the incinerator depresses housing prices? Estimate this equation and interpret the 
results. 

To the simple regression model in part (i), add the variables log(intst), log(area), log(land), 
rooms, baths, and age, where intst is distance from the home to the interstate, area is square 
footage of the house, land is the lot size in square feet, rooms is total number of rooms, baths is 
number of bathrooms, and age is age of the house in years. Now, what do you conclude about 
the effects of the incinerator? Explain why (i) and (ii) give conflicting results. 

Add [log(intst) ? to the model from part (ii). Now what happens? What do you conclude about 
the importance of functional form? 

Is the square of log(dist) significant when you add it to the model from part (iii)? 


C2 Use the data in WAGEI for this exercise. 


(i) 


(ii) 


(iii) 


(iv) 


Use OLS to estimate the equation 
log(wage) = By + Bieduc + Brexper + Byexper? + u 


and report the results using the usual format. 
Is exper’ statistically significant at the 1% level? 
Using the approximation 


%Awage ~= 100(Ê, + 2Byexper) Aexper, 


find the approximate return to the fifth year of experience. What is the approximate return to the 
twentieth year of experience? 

At what value of exper does additional experience actually lower predicted log(wage)? How 
many people have more experience in this sample? 
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C3 Consider a model in which the return to education depends upon the amount of work experience (and 
vice versa): 


log(wage) = By + Byeduc + Boexper + B,educ-exper + u. 


(i) Show that the return to another year of education (in decimal form), holding exper fixed, is 
Bı + exper. 

(ii) State the null hypothesis that the return to education does not depend on the level of exper. 
What do you think is the appropriate alternative? 

(iii) Use the data in WAGE? to test the null hypothesis in (ii) against your stated alternative. 

(iv) Let 6, denote the return to education (in decimal form), when exper = 10: 6, = B, + 1083. 
Obtain 6, and a 95% confidence interval for 6,. (Hint: Write B, = 6, — 106; and plug this 
into the equation; then rearrange. This gives the regression for obtaining the confidence interval 
for 6,.) 


C4 Use the data in GPA2 for this exercise. 
(i) Estimate the model 


sat = By + Byhsize + Bohsize? + u, 


where hsize is the size of the graduating class (in hundreds), and write the results in the usual 
form. Is the quadratic term statistically significant? 

(i) Using the estimated equation from part (i), what is the “optimal” high school size? Justify your 
answer. 

(iii) Is this analysis representative of the academic performance of all high school seniors? 
Explain. 

(iv) Find the estimated optimal high school size, using log(sat) as the dependent variable. Is it much 
different from what you obtained in part (ii)? 


C5 Use the housing price data in HPRICEI for this exercise. 
(i) Estimate the model 


log(price) = By + B,log(lotsize) + Bolog(sqrft) + B,bdrms + u 


and report the results in the usual OLS format. 

(ii) Find the predicted value of log(price), when lotsize = 20,000, sqrft = 2,500, and bdrms = 4. 
Using the methods in Section 6-4, find the predicted value of price at the same values of the 
explanatory variables. 

(iii) For explaining variation in price, decide whether you prefer the model from part (i) or the 
model 


price = By + B,lotsize + B.sqrft + B3bdrms + u. 


C6 Use the data in VOTE] for this exercise. 
(i) Consider a model with an interaction between expenditures: 


voteA = By + BiprtystrA + B,expendA + B,expendB + ByexpendA-expendB + u. 


What is the partial effect of expendB on voteA, holding prtystrA and expendA fixed? What is the 
partial effect of expendA on voteA? Is the expected sign for By obvious? 

(ii) Estimate the equation in part (i) and report the results in the usual form. Is the interaction term 
statistically significant? 

(iii) Find the average of expendA in the sample. Fix expendA at 300 (for $300,000). What is the esti- 
mated effect of another $100,000 spent by Candidate B on voteA? Is this a large effect? 
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C7 


C8 


c9 


Gv) 


(v) 


(vi) 


Now fix expendB at 100. What is the estimated effect of AexpendA = 100 on voteA? Does this 
make sense? 

Now, estimate a model that replaces the interaction with shareA, Candidate A’s percentage share 
of total campaign expenditures. Does it make sense to hold both expendA and expendB fixed, 
while changing shareA? 

(Requires calculus) In the model from part (v), find the partial effect of expendB on voteA, hold- 
ing prtystrA and expendA fixed. Evaluate this at expendA = 300 and expendB = 0 and com- 
ment on the results. 


Use the data in ATTEND for this exercise. 


(i) 


(ii) 


(iii) 


In the model of Example 6.3, argue that 
Astndfnl/ApriGPA = B, + 2B,priGPA + Beatndrte. 


Use equation (6.19) to estimate the partial effect when priGPA = 2.59 and atndrte = 82. 
Interpret your estimate. 
Show that the equation can be written as 


stndfnl = 09 + B,atndrte + 0,priGPA + B,ACT + B,(priGPA — 2.59)? 
+ B;ACT’ + BspriGPA(atndrte — 82) + u, 


where 0, = B, + 28,(2.59) + B¢(82). (Note that the intercept has changed, but this is unim- 
portant.) Use this to obtain the standard error of 6, from part (i). 

Suppose that, in place of priGPA(atndrte — 82), you put (priGPA — 2.59) + (atndrte — 82). 
Now how do you interpret the coefficients on atndrte and priGPA? 


Use the data in HPRICE1 for this exercise. 


G) 


(ii) 


(iii) 


Estimate the model 
price = By + Blotsize + Bosqrft + B3bdrms + u 


and report the results in the usual form, including the standard error of the regression. Obtain 
predicted price, when we plug in lotsize = 10,000, sqrft = 2,300, and bdrms = 4; round this 
price to the nearest dollar. 

Run a regression that allows you to put a 95% confidence interval around the predicted value in 
part (i). Note that your prediction will differ somewhat due to rounding error. 

Let price? be the unknown future selling price of the house with the characteristics used in parts 
(i) and (ii). Find a 95% CI for price? and comment on the width of this confidence interval. 


The data set NBASAL contains salary information and career statistics for 269 players in the National 
Basketball Association (NBA). 


G) 


(ii) 
(iii) 
(iv) 


(v) 


Estimate a model relating points-per-game (points) to years in the league (exper), age, and years 
played in college (coll). Include a quadratic in exper; the other variables should appear in level 
form. Report the results in the usual way. 

Holding college years and age fixed, at what value of experience does the next year of experi- 
ence actually reduce points-per-game? Does this make sense? 

Why do you think coll has a negative and statistically significant coefficient? (Hint: NBA players 
can be drafted before finishing their college careers and even directly out of high school.) 

Add a quadratic in age to the equation. Is it needed? What does this appear to imply about the 
effects of age, once experience and education are controlled for? 

Now regress log(wage) on points, exper, exper’, age, and coll. Report the results in the usual 
format. 
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(vi) Test whether age and coll are jointly significant in the regression from part (v). What does this 
imply about whether age and education have separate effects on wage, once productivity and 
seniority are accounted for? 


C10 Use the data in BWGHT2 for this exercise. 
(i) Estimate the equation 


log(bwght) = By + Bynpvis + Bonpvis* + u 


by OLS, and report the results in the usual way. Is the quadratic term significant? 

(i) Show that, based on the equation from part (i), the number of prenatal visits that maximizes 
log(bwght) is estimated to be about 22. How many women had at least 22 prenatal visits in the 
sample? 

(iii) Does it make sense that birth weight is actually predicted to decline after 22 prenatal visits? Explain. 

(iv) Add mother’s age to the equation, using a quadratic functional form. Holding npvis fixed, at 
what mother’s age is the birth weight of the child maximized? What fraction of women in the 
sample are older than the “optimal” age? 

(v) Would you say that mother’s age and number of prenatal visits explain a lot of the variation in 
log(bwght)? 

(vi) Using quadratics for both npvis and age, decide whether using the natural log or the level of 
bwght is better for predicting bwght. 


Ci1 Use APPLE to verify some of the claims made in Section 6-3. 

(i) Run the regression ecolbs on ecoprc, regprc and report the results in the usual form, including 
the R-squared and adjusted R-squared. Interpret the coefficients on the price variables and 
comment on their signs and magnitudes. 

(ii) Are the price variables statistically significant? Report the p-values for the individual ¢ tests. 

(iii) What is the range of fitted values for ecolbs? What fraction of the sample reports ecolbs = 0? 
Comment. 

(iv) Do you think the price variables together do a good job of explaining variation in ecolbs? 
Explain. 

(v) Add the variables faminc, hhsize (household size), educ, and age to the regression from part (i). 
Find the p-value for their joint significance. What do you conclude? 

(vi) Run separate simple regressions of ecolbs on ecoprc and then ecolbs on regprc. How do the 
simple regression coefficients compare with the multiple regression from part (i)? Find the 
correlation coefficient between ecoprc and regprc to help explain your findings. 

(vii) Consider a model that adds family income and the quantity demanded for regular apples: 


ecolbs = By + Byecoprc + Boregprc + B3faminc + Byreglbs + u. 


From basic economic theory, which explanatory variable does not belong to the equation? When 
you drop the variables one at a time, do the sizes of the adjusted R-squareds affect your answer? 


C12 Use the subset of 401KSUBS with fsize = 1; this restricts the analysis to single-person households; 
see also Computer Exercise C8 in Chapter 4. 
(i) The youngest age in the sample is 25. How many people are 25 years old? 
(ii) In the model 


nettfa = By + Byinc + Bage + Bage + u, 


what is the literal interpretation of 8,? By itself, is it of much interest? 
(iii) Estimate the model from part (ii) and report the results in standard form. Are you concerned that 
the coefficient on age is negative? Explain. 
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C13 


C14 


(iv) 


(v) 


(vi) 


Because the youngest people in the sample are 25, it makes sense to think that, for a given 
level of income, the lowest average amount of net total financial assets is at age 25. Recall 

that the partial effect of age on nettfa is B, + 2B3age, so the partial effect at age 25 is 

B2 + 283(25) = B, + 5063; call this 05. Find 6, and obtain the two-sided p-value for testing 
Ho: 82 = 0. You should conclude that 6, is small and very statistically insignificant. [Hint: One 
way to do this is to estimate the model nettfa = ay + B,inc + O,age + Blage — 25)? + u, 
where the intercept, œọ is different from By. There are other ways, too.] 

Because the evidence against Hp: 6, = 0 is very weak, set it to zero and estimate the model 


nettfa = ay + Byinc + B;(age — 25)? + u. 


In terms of goodness-of-fit, does this model fit better than that in part (ii)? 
For the estimated equation in part (v), set inc = 30 (roughly, the average value) and graph the 
relationship between nettfa and age, but only for age = 25. Describe what you see. 


(vii) Check to see whether including a quadratic in inc is necessary. 


Use the data in MEAPO00 to answer this question. 


(i) 


(ii) 
(iii) 
(iv) 


(v) 


Estimate the model 
math4 = By + B.lexppp + Blenroll + B3lunch + u 


by OLS, and report the results in the usual form. Is each explanatory variable statistically 
significant at the 5% level? 

Obtain the fitted values from the regression in part (i). What is the range of fitted values? How 
does it compare with the range of the actual data on math4? 

Obtain the residuals from the regression in part (i). What is the building code of the school that 
has the largest (positive) residual? Provide an interpretation of this residual. 

Add quadratics of all explanatory variables to the equation, and test them for joint significance. 
Would you leave them in the model? 

Returning to the model in part (i), divide the dependent variable and each explanatory variable 
by its sample standard deviation, and rerun the regression. (Include an intercept unless you also 
first subtract the mean from each variable.) In terms of standard deviation units, which explana- 
tory variable has the largest effect on the math pass rate? 


Use the data in BENEFITS to answer this question. It is a school-level data set at the K—5 level on aver- 
age teacher salary and benefits. See Example 4.10 for background. 


G) 


(ii) 


(iii) 
(iv) 


(v) 
(vi) 


(vii) 


Regress lavgsal on bs and report the results in the usual form. Can you reject 

Ho: ps = O against a two-sided alternative? Can you reject Ho: Bp, = —1 against 

Hı: Bss > —1? Report the p-values for both tests. 

Define lbs = log(bs). Find the range of values for lbs and find its standard deviation. 
How do these compare to the range and standard deviation for bs? 

Regress lavgsal on lbs. Does this fit better than the regression from part (i)? 
Estimate the equation 


lavgsal = By + Bibs + By lenroll + B3lstaff + Bylunch + u 


and report the results in the usual form. What happens to the coefficient on bs? Is it now statisti- 
cally different from zero? 

Interpret the coefficient on /staff. Why do you think it is negative? 

Add lunch? to the equation from part (iv). Is it statistically significant? Compute the turning 
point (minimum value) in the quadratic, and show that it is within the range of the observed data 
on lunch. How many values of lunch are higher than the calculated turning point? 

Based on the findings from part (vi), describe how teacher salaries relate to school poverty rates. 

In terms of teacher salary, and holding other factors fixed, is it better to teach at a school with 

lunch = 0 (no poverty), lunch = 50, or lunch = 100 (all kids eligible for the free lunch program)? 
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APPENDIX 6A 


6A. A Brief Introduction to Bootstrapping 


In many cases where formulas for standard errors are hard to obtain mathematically, or where they 
are thought not to be very good approximations to the true sampling variation of an estimator, we 
can rely on a resampling method. The general idea is to treat the observed data as a population 
that we can draw samples from. The most common resampling method is the bootstrap. (There are 
actually several versions of the bootstrap, but the most general, and most easily applied, is called the 


nonparametric bootstrap, and that is what we describe here.) 

Suppose we have an estimate, 6, of a population parameter, 0. We obtained this estimate, which 
could be a function of OLS estimates (or estimates that we cover in later chapters), from a random 
sample of size n. We would like to obtain a standard error for Ô that can be used for constructing t 
statistics or confidence intervals. Remarkably, we can obtain a valid standard error by computing the 
estimate from different random samples drawn from the original data. 

Implementation is easy. If we list our observations from 1 through n, we draw n numbers ran- 
domly, with replacement, from this list. This produces a new data set (of size n) that consists of the 
original data, but with many observations appearing multiple times (except in the rather unusual case 
that we resample the original data). Each time we randomly sample from the original data, we can 
estimate 0 using the same procedure that we used on the original data. Let 6 denote the estimate 
from bootstrap sample b. Now, if we repeat the resampling and estimation m times, we have m new 
estimates, {6: b = 1, 2,..., m}. The bootstrap standard error of 4 is just the sample standard 
deviation of the 6 ) namely, 


m 1/2 


bse() = | (m — 1)7' (6 — 67], [6.50] 


where 6 is the average of the bootstrap estimates. 

If obtaining an estimate of 0 on a sample of size n requires little computational time, as in the 
case of OLS and all the other estimators we encounter in this text, we can afford to choose m—the 
number of bootstrap replications—to be large. A typical value is m = 1,000, but even m = 500 or 
a somewhat smaller value can produce a reliable standard error. Note that the size of m—the num- 
ber of times we resample the original data—has nothing to do with the sample size, n. (For certain 
estimation problems beyond the scope of this text, a large n can force one to do fewer bootstrap 
replications.) Many statistics and econometrics packages have built-in bootstrap commands, and this 
makes the calculation of bootstrap standard errors simple, especially compared with the work often 
required to obtain an analytical formula for an asymptotic standard error. 

One can actually do better in most cases by using the bootstrap sample to compute p-values for 
t statistics (and F statistics), or for obtaining confidence intervals, rather than obtaining a bootstrap 
standard error to be used in the construction of t statistics or confidence intervals. See Horowitz 
(2001) for a comprehensive treatment. 
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Imost all of our discussion in the previous chapters has focused on the case where the depend- 

ent and independent variables in our multiple regression models have quantitative meaning. 

Just a few examples include hourly wage rate, years of education, college grade point average, 
amount of air pollution, level of firm sales, and number of arrests. In each case, the magnitude of the 
variable conveys useful information. In some cases, we take the natural log and then the coefficients 
can be turned into percentage changes. 

In Section 2-7 we introduced the notion of a binary (or dummy) explanatory variable, and we dis- 
cussed how simple regression on a binary variable can be used to evaluate randomized interventions. 
We showed how to extend program evaluation to the multiple regression case in Sections 3-7e and 4-7 
when it is necessary to account for observed differences between the control and treatment groups. 

The purpose of this chapter is to provide a comprehensive analysis of how to include qualita- 
tive factors into regression models. In addition to indicators of participating in a program, or being 
subjected to a new policy, the race or ethnicity of an individual, marital status, the industry of a firm 
(manufacturing, retail, and so on), and the region in the United States where a city is located (South, 
North, West, and so on) are common examples of qualitative factors. 

After we discuss the appropriate ways to describe qualitative informationinSection7-1, we show 
how qualitative explanatory variables can be easily incorporated into multiple regression models 


in Sections 7-2, 7-3, and 7-4. These sections cover almost all of the popular ways that qualitative 
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independent variables are used in cross-sectional regression analysis, including creating interactions 
among qualitative variables and between qualitative and quantitative variables. 

In Section 7-5, we discuss the case where our dependent variable is binary, which is a particular 
kind of qualitative dependent variable. The multiple regression model is called the linear probability 
model (LPM), and the coefficients can be interpreted as changes in a probability. While much maligned 
by some econometricians, the simplicity of the LPM makes it useful in many empirical contexts. We 
will describe drawbacks of the LPM in Section 7-5, but they are often secondary in empirical work. 

Section 7.6 reconsiders policy analysis, including the potential outcomes perspective, and pro- 
poses a flexible regression approach for estimating the effects of interventions. Section 7.7 is a short 
section that explains how to interpret multiple regression estimates when y is a discrete variable that 
has quantitative meeting. 

This chapter does not assume you have read the material on potential outcomes and policy analy- 
sis in Chapters 2, 3, and 4, and so it stands alone as a dicussion of how to incorporate qualitative 


information into regression. 


7-1 Describing Qualitative Information 


Qualitative factors often come in the form of binary information: a person is female or male; a person 
does or does not own a personal computer; a firm offers a certain kind of employee pension plan or it 
does not; a state administers capital punishment or it does not. In all of these examples, the relevant infor- 
mation can be captured by defining a binary variable or a zero-one variable. In econometrics, binary 
variables are most commonly called dummy variables, although this name is not especially descriptive. 

In defining a dummy variable, we must decide 
which event is assigned the value one and which is 
assigned the value zero. For example, in a study of indi- 
vidual wage determination, we might define female to 
be a binary variable taking on the value one for females 
and the value zero for males. The name in this case 
indicates the event with the value one. The same infor- 
mation is captured by defining male to be one if the 
person is male and zero if the person is female. Either 
of these is better than using gender because this name 
does not make it clear when the dummy variable is one: does gender = 1 correspond to male or female? 
What we call our variables is unimportant for getting regression results, but it always helps to choose 
names that clarify equations and expositions. 

Suppose in the wage example that we have chosen the name female to indicate gender. Further, 
we define a binary variable married to equal one if a person is married and zero if otherwise. 
Table 7.1 gives a partial listing of a wage data set that might result. We see that Person 1 is female and 
not married, Person 2 is female and married, Person 3 is male and not married, and so on. 

Why do we use the values zero and one to describe qualitative information? In a sense, these 
values are arbitrary: any two different values would do. The real benefit of capturing qualitative infor- 
mation using zero-one variables is that it leads to regression models where the parameters have very 
natural interpretations, as we will see now. 


GOING FURTHER 7.1 


Suppose that, in a study comparing elec- 
tion outcomes between Democratic and 
Republican candidates, you wish to indicate 
the party of each candidate. Is a name such 
as party a wise choice for a binary variable 
in this case? What would be a better name? 
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TABLE 7.1 A Partial Listing of the Data in WAGE1 


person wage educ exper female married 
1 3.10 11 2 1 0 
2 3.24 12 22 1 1 
3 3.00 11 2 0 0 
4 6.00 8 44 0 1 
5 5.30 12 Y 0 1 
525 11.56 16 5 0 1 
526 3.50 14 5 1 0 


7-2 ASingle Dummy Independent Variable 


How do we incorporate binary information into regression models? In the simplest case, with only 
a single dummy explanatory variable, we just add it as an independent variable in the equation. For 
example, consider the following simple model of hourly wage determination: 


wage = By + dyfemale + B,educ + u. [7.1] 


We use ô as the parameter on female in order to highlight the interpretation of the parameters multi- 
plying dummy variables; later, we will use whatever notation is most convenient. 

In model (7.1), only two observed factors affect wage: gender and education. Because female = | 
when the person is female, and female = 0 when the person is male, the parameter 6, has the follow- 
ing interpretation: 6, is the difference in hourly wage between females and males, given the same 
amount of education (and the same error term u). Thus, the coefficient 6) determines whether there 
is discrimination against women: if 5) < 0, then for the same level of other factors, women earn less 
than men on average. 

In terms of expectations, if we assume the zero conditional mean assumption E(ulfemale, 
educ) = 0, then 


ôo = E(wage| female = 1,educ) — E(wage| female = 0,educ). 


Because female = 1 corresponds to females and female = 0 corresponds to males, we can write this 
more simply as 


ôo = E(wagelfemale,educ) — E(wage|male,educ). [7.2] 


The key here is that the level of education is the same in both expectations; the difference, 59, is due 
to gender only. 

The situation can be depicted graphically as an intercept shift between males and females. In 
Figure 7.1, the case 6) < 0 is shown, so that men earn a fixed amount more per hour than women. The 
difference does not depend on the amount of education, and this explains why the wage-education 
profiles for women and men are parallel. 

At this point, you may wonder why we do not also include in (7.1) a dummy variable, say male, 
which is one for males and zero for females. This would be redundant. In (7.1), the intercept for males 
is Bo, and the intercept for females is Bọ + ôo. Because there are just two groups, we only need two 
different intercepts. This means that, in addition to Bp, we need to use only one dummy variable; we 
have chosen to include the dummy variable for females. Using two dummy variables would introduce 
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FIGURE 7.1 Graph of wage = Bo + ô female + B educ for 5) < 0. 


wage 
men: wage = B, + B,educ 
women: 
wage = (B, + ô) + B, educ 
2 
Bo 
Bo + ôo 
0) educ 


perfect collinearity because female + male = 1, which means that male is a perfect linear function of 
female. Including dummy variables for both genders is the simplest example of the so-called dummy 
variable trap, which arises when too many dummy variables describe a given number of groups. We 
will discuss this problem in detail later. 

In (7.1), we have chosen males to be the base group or benchmark group, that is, the group 
against which comparisons are made. This is why , is the intercept for males, and 6, is the difference 
in intercepts between females and males. We could choose females as the base group by writing the 
model as 


wage = a + yomale + B,educ + u, 


where the intercept for females is aj and the intercept for males is ay + yp; this implies that 
Q = Bo + ô and ay + Yo = Bo. In any application, it does not matter how we choose the base 
group, but it is important to keep track of which group is the base group. 

Some researchers prefer to drop the overall intercept in the model and to include dummy vari- 
ables for each group. The equation would then be wage = Bymale + agfemale + B,educ + u, where 
the intercept for men is Bp and the intercept for women is a. There is no dummy variable trap in this 
case because we do not have an overall intercept. However, this formulation has little to offer, because 
testing for a difference in the intercepts is more difficult, and there is no generally agreed upon way to 
compute R-squared in regressions without an intercept. Therefore, we will always include an overall 
intercept for the base group. 

Nothing much changes when more explanatory variables are involved. Taking males as the base 
group, a model that controls for experience and tenure in addition to education is 


wage = By + dyfemale + B,educ + B,exper + B3tenure + u. [7.3] 
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If educ, exper, and tenure are all relevant productivity characteristics, the null hypothesis of no dif- 
ference between men and women is Hy: 6) = 0. The alternative that there is discrimination against 
women is H,: 6) < 0. 

How can we actually test for wage discrimination? The answer is simple: just estimate the model 
by OLS, exactly as before, and use the usual ft statistic. Nothing changes about the mechanics of OLS 
or the statistical theory when some of the independent variables are defined as dummy variables. The 
only difference with what we have done up until now is in the interpretation of the coefficient on the 
dummy variable. 


Hourly Wage Equation 


Using the data in WAGE], we estimate model (7.3). For now, we use wage, rather than log(wage), as 
the dependent variable: 


wage = —1.57 — 1.81 female + .572 educ + 0.25 exper + .141 tenure 
(.72) (26) (.049) (.012) (.021) [7.4] 
n = 526, R? = .364. 

The negative intercept—the intercept for men, in this case—is not very meaningful because no one 
has zero values for all of educ, exper, and tenure in the sample. The coefficient on female is interest- 
ing because it measures the average difference in hourly wage between a man and a woman who have 
the same levels of educ, exper, and tenure. If we take a woman and a man with the same levels of 
education, experience, and tenure, the woman earns, on average, $1.81 less per hour than the man. 
(Recall that these are 1976 wages.) 

It is important to remember that, because we have performed multiple regression and controlled 
for educ, exper, and tenure, the $1.81 wage differential cannot be explained by different average lev- 
els of education, experience, or tenure between men and women. We can conclude that the differen- 
tial of $1.81 is due to gender or factors associated with gender that we have not controlled for in the 
regression. [In 2013 dollars, the wage differential is about 4.09(1.81) ~ 7.40.] 

It is informative to compare the coefficient on female in equation (7.4) to the estimate we get 
when all other explanatory variables are dropped from the equation: 


wage = 7.10 — 2.51 female 
(.21) (.30) [7.5] 
n = 526, R? = .116. 


As discussed in Section 2.7, the coefficients in (7.5) have a simple interpretation. The intercept is the 
average wage for men in the sample (let female = 0), so men earn $7.10 per hour on average. The 
coefficient on female is the difference in the average wage between women and men. Thus, the aver- 
age wage for women in the sample is 7.10 — 2.51 = 4.59, or $4.59 per hour. (Incidentally, there are 
274 men and 252 women in the sample.) 

Equation (7.5) provides a simple way to carry out a comparison-of-means test between the two 
groups, which in this case are men and women. The estimated difference, —2.51, has a ż statistic of 
—8.37, which is very statistically significant (and, of course, $2.51 is economically large as well). 
Generally, simple regression on a constant and a dummy variable is a straightforward way to compare 
the means of two groups. For the usual ż test to be valid, we must assume that the homoskedasticity 
assumption holds, which means that the population variance in wages for men is the same as that for 
women. 

The estimated wage differential between men and women is larger in (7.5) than in (7.4) because 
(7.5) does not control for differences in education, experience, and tenure, and these are lower, on 
average, for women than for men in this sample. Equation (7.4) gives a more reliable estimate of the 
ceteris paribus gender wage gap; it still indicates a very large differential. 
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In many cases, dummy independent variables reflect choices of individuals or other economic 
units (as opposed to something predetermined, such as gender). In such situations, the matter of cau- 
sality is again a central issue. In the following example, we would like to know whether personal 
computer ownership causes a higher college grade point average. 


Effects of Computer Ownership on College GPA 


In order to determine the effects of computer ownership on college grade point average, we estimate 
the model 


colGPA = By + &PC + B,hsGPA + BACT + u, 


where the dummy variable PC equals one if a student owns a personal computer and zero otherwise. 
There are various reasons PC ownership might have an effect on colGPA. A student’s schoolwork 
might be of higher quality if it is done on a computer, and time can be saved by not having to wait at a 
computer lab. Of course, a student might be more inclined to play computer games or surf the Internet 
if he or she owns a PC, so it is not obvious that ô is positive. The variables hsGPA (high school GPA) 
and ACT (achievement test score) are used as controls: it could be that stronger students, as measured 
by high school GPA and ACT scores, are more likely to own computers. We control for these factors 
because we would like to know the average effect on colGPA if a student is picked at random and 
given a personal computer. 
Using the data in GPA1, we obtain 


ee ras 
colGPA = 1.26 + .157 PC + .447 hsGPA + .0087 ACT 
(.33) (.057)  (.094) (.0105) [7.6] 
n = 141, R = 219. 


This equation implies that a student who owns a PC has a predicted GPA about .16 points higher than 
a comparable student without a PC (remember, both colGPA and hsGPA are on a four-point scale). 
The effect is also very statistically significant, with tp. = .157/.057 = 2.75. 

What happens if we drop hsGPA and ACT from the equation? Clearly, dropping the latter vari- 
able should have very little effect, as its coefficient and f statistic are very small. But hsGPA is very 
significant, and so dropping it could affect the estimate of Spc. Regressing colGPA on PC gives an 
estimate on PC equal to about .170, with a standard error of .063; in this case, Boc and its f statistic do 
not change by much. 

In the exercises at the end of the chapter, you will be asked to control for other factors in the 
equation to see if the computer ownership effect disappears, or if it at least gets notably smaller. 


Each of the previous examples can be viewed as having relevance for policy analysis. In the 
first example, we were interested in gender discrimination in the workforce. In the second example, 
we were concerned with the effect of computer ownership on college performance. A special case of 
policy analysis is program evaluation, where we would like to know the effect of economic or social 
programs on individuals, firms, neighborhoods, cities, and so on. 

In the simplest case, there are two groups of subjects. The control group does not participate 
in the program. The experimental group or treatment group does take part in the program. These 
names come from literature in the experimental sciences, and they should not be taken literally. 
Except in rare cases, the choice of the control and treatment groups is not random. However, in some 
cases, multiple regression analysis can be used to control for enough other factors in order to estimate 
the causal effect of the program. 
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Effects of Training Grants on Hours of Training 


Using the 1988 data for Michigan manufacturing firms in JTRAIN, we obtain the following estimated 
equation: 


hrsemp = 46.67 + 26.25 grant — .98 log(sales) — 6.07 log(employ) 
(43.41) (5.59) (3.54) (3.88) [7.7] 
n = 105, R? = .237. 


The dependent variable is hours of training per employee, at the firm level. The variable grant is a 
dummy variable equal to one if the firm received a job training grant for 1988, and zero otherwise. 
The variables sales and employ represent annual sales and number of employees, respectively. We 
cannot enter hrsemp in logarithmic form because hrsemp is zero for 29 of the 105 firms used in the 
regression. 

The variable grant is very statistically significant, with tanı = 4.70. Controlling for sales and 
employment, firms that received a grant trained each worker, on average, 26.25 hours more. Because 
the average number of hours of per worker training in the sample is about 17, with a maximum value 
of 164, grant has a large effect on training, as is expected. 

The coefficient on log(sales) is small and very insignificant. The coefficient on log(employ) 
means that, if a firm is 10% larger, it trains its workers about .61 hour less. Its ¢ statistic is — 1.56, 
which is only marginally statistically significant. 


As with any other independent variable, we should ask whether the measured effect of a qualitative 
variable is causal. In equation (7.7), is the difference in training between firms that receive grants and 
those that do not due to the grant, or is grant receipt simply an indicator of something else? It might be 
that the firms receiving grants would have, on average, trained their workers more even in the absence 
of a grant. Nothing in this analysis tells us whether we have estimated a causal effect; we must know 
how the firms receiving grants were determined. We can only hope we have controlled for as many 
factors as possible that might be related to whether a firm received a grant and to its levels of training. 

In Section 7.6 we return to policy analysis using binary indicators, including obtaining a more flex- 
ible framework in the context of potential outcomes. These themes reappear in the remainder of the text. 


7-2a Interpreting Coefficients on Dummy Explanatory 
Variables When the Dependent Variable Is log(y) 


A common specification in applied work has the dependent variable appearing in logarithmic form, 
with one or more dummy variables appearing as independent variables. How do we interpret the dummy 
variable coefficients in this case? Not surprisingly, the coefficients have a percentage interpretation. 


Housing Price Regression 


Using the data in HPRICE1, we obtain the equation 
log(price) = —1.35 + .168 log(lotsize) + .707 log(sqrft) 


(.65) (.038) (.093) 
+ .027 bdrms + .054 colonial [7.8] 
(.029) (.045) 


n = 88, R? = .649. 
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All the variables are self-explanatory except colonial, which is a binary variable equal to one if the 
house is of the colonial style. What does the coefficient on colonial mean? For given levels of lotsize, 
sqrft, and bdrms, the difference in log(price) between a house of colonial style and that of another 
style is .054. This means that a colonial-style house is predicted to sell for about 5.4% more, holding 
other factors fixed. 


This example shows that, when log(y) is the dependent variable in a model, the coefficient on a 
dummy variable, when multiplied by 100, is interpreted as the percentage difference in y, holding all 
other factors fixed. When the coefficient on a dummy variable suggests a large proportionate change 
in y, the exact percentage difference can be obtained exactly as with the semi-elasticity calculation in 
Section 6-2. 


Log Hourly Wage Equation 


Let us reestimate the wage equation from Example 7.1, using log(wage) as the dependent variable and 
adding quadratics in exper and tenure: 


Pearle 
log(wage) = .417 — .297 female + .080 educ + .029 exper 


(.099) (.036) (.007) (.005) 
— 00058 exper’ + .032 tenure — .00059 tenure* [7.9] 
(.00010) (.007) (.00023) 


n = 526, R = 441. 


Using the same approximation as in Example 7.4, the coefficient on female implies that, for the 
same levels of educ, exper, and tenure, women earn about 100(.297) = 29.7% less than men. We 
can do better than this by computing the exact percentage difference in predicted wages. What we 
want is the proportionate difference in wages between females and males, holding other factors fixed: 
(wager = wagen)/wagen,. What we have from (7.9) is 


log(wage;) — log(wagey) = —.297. 


Exponentiating and subtracting one gives 


(Wager — Wagey)/Wwagey = exp(—.297) — 1 ~ —.257. 


This more accurate estimate implies that a woman’s wage is, on average, 25.7% below a comparable 
man’s wage. 


If we had made the same correction in Example 7.4, we would have obtained exp(.054) — 1 
= 0555, or about 5.6%. The correction has a smaller effect in Example 7.4 than in the wage example 
because the magnitude of the coefficient on the dummy variable is much smaller in (7.8) than in (7.9). 

Generally, if B ı is the coefficient on a dummy variable, say x,, when log(y) is the dependent vari- 
able, the exact percentage difference in the predicted y when x, = 1 versus when x, = 0 is 


100-[exp(B,) — 1]. [7.10] 

The estimate Êi can be positive or negative, and it is important to preserve its sign in computing (7.10). 
The logarithmic approximation has the advantage of providing an estimate between the mag- 
nitudes obtained by using each group as the base group. In particular, although equation (7.10) 
gives us a better estimate than 100- Êi of the percentage by which y for x, = 1 is greater than y for 
x, = 0, (7.10) is not a good estimate if we switch the base group. In Example 7.5, we can estimate 
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the percentage by which a man’s wage exceeds a comparable woman’s wage, and this estimate is 
100-[exp(—8,) — 1] = 100-[exp(.297) — 1] ~ 34.6. The approximation, based on 100-B,, 29.7, is 
between 25.7 and 34.6 (and close to the middle). Therefore, it makes sense to report that “the differ- 
ence in predicted wages between men and women is about 29.7%,” without having to take a stand on 
which is the base group. 


7-3 Using Dummy Variables for Multiple Categories 


We can use several dummy independent variables in the same equation. For example, we could add 
the dummy variable married to equation (7.9). The coefficient on married gives the (approximate) 
proportional differential in wages between those who are and are not married, holding female, educ, 
exper, and tenure fixed. When we estimate this model, the coefficient on married (with standard error 
in parentheses) is .053 (.041), and the coefficient on female becomes —.290 (.036). Thus, the “mar- 
riage premium” is estimated to be about 5.3%, but it is not statistically different from zero (t = 1.29). 
An important limitation of this model is that the marriage premium is assumed to be the same for men 
and women; this is relaxed in the following example. 


Log Hourly Wage Equation 


Let us estimate a model that allows for wage differences among four groups: married men, married 
women, single men, and single women. To do this, we must select a base group; we choose single 
men. Then, we must define dummy variables for each of the remaining groups. Call these marrmale, 
marrfem, and singfem. Putting these three variables into (7.9) (and, of course, dropping female, as it 
is now redundant) gives 


—_—_—— 
log(wage) = .321 + .213 marrmale — .198 marrfem 


(.100) (.055) (.058) 
— .110 singfem + .079 educ + .027 exper — .00054 exper’ 
(.056) (.007) (.005) (.00011) [7.11] 
+ .029 tenure — .00053 tenure’ 
(.007) (.00023) 


n = 526, R? = .461. 


All of the coefficients, with the exception of singfem, have t statistics well above two in absolute 
value. The f statistic for singfem is about —1.96, which is just significant at the 5% level against a 
two-sided alternative. 

To interpret the coefficients on the dummy variables, we must remember that the base group is 
single males. Thus, the estimates on the three dummy variables measure the proportionate difference 
in wage relative to single males. For example, married men are estimated to earn about 21.3% more 
than single men, holding levels of education, experience, and tenure fixed. [The more precise estimate 
from (7.10) is about 23.7%.] A married woman, on the other hand, earns a predicted 19.8% less than 
a single man with the same levels of the other variables. 

Because the base group is represented by the intercept in (7.11), we have included dummy vari- 
ables for only three of the four groups. If we were to add a dummy variable for single males to (7.11), 
we would fall into the dummy variable trap by introducing perfect collinearity. Some regression pack- 
ages will automatically correct this mistake for you, while others will just tell you there is perfect 
collinearity. It is best to carefully specify the dummy variables because then we are forced to properly 
interpret the final model. 
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Even though single men is the base group in (7.11), we can use this equation to obtain the esti- 
mated difference between any two groups. Because the overall intercept is common to all groups, we 
can ignore that in finding differences. Thus, the estimated proportionate difference between single 
and married women is —.110 — (—.198) = .088, which means that single women earn about 8.8% 
more than married women. Unfortunately, we cannot use equation (7.11) for testing whether the esti- 
mated difference between single and married women is statistically significant. Knowing the standard 
errors on marrfem and singfem is not enough to carry out the test (see Section 4-4). The easiest thing 
to do is to choose one of these groups to be the base group and to reestimate the equation. Nothing 
substantive changes, but we get the needed estimate and its standard error directly. When we use mar- 
ried women as the base group, we obtain 


Pot Ee 
log(wage) = .123 + .411 marrmale + .198 singmale + .088 singfem + ~, 
(.106) (.056) (.058) (.052) 


where, of course, none of the unreported coefficients or standard errors have changed. The estimate 
on singfem is, as expected, .088. Now, we have a standard error to go along with this estimate. The t 
statistic for the null that there is no difference in the population between married and single women 
İS fingem = -088/.052 = 1.69. This is marginal evidence against the null hypothesis. We also see that 


the estimated difference between married men and married women is very statistically significant 
= 7.34). 


( Enarrmale 


The previous example illustrates a general principle for including dummy variables to indicate dif- 
ferent groups: if the regression model is to have different intercepts for, say, g groups or categories, we 
need to include g — | dummy variables in the model along with an intercept. The intercept for the base 

group is the overall intercept in the model, and the 


dummy variable coefficient for a particular group rep- 
resents the estimated difference in intercepts between 
In the baseball salary data found in MLB1, that group and the base group. Including g dummy var- 
players are given one of six positions: iables along with an intercept will result in the dummy 
frstbase, scndbase, thrdbase, shrtstop, | Variable trap. An alternative is to include g dummy 
outfield, or catcher. To allow for salary dif- variables and to exclude an overall intercept. Including 
ferentials across position, with outfield- | g dummies without an overall intercept is sometimes 
ers as the base group, which dummy | useful, but it has two practical drawbacks. First, it 
variables would you include as independent | makes it more cumbersome to test for differences 
variables? relative to a base group. Second, regression packages 

usually change the way R-squared is computed when 

an overall intercept is not included. In particular, in the 
formula R? = 1 — SSR/SST, the total sum of squares, SST, is replaced with a total sum of squares that 
does notcenter y,about its mean, say,SST) = >7_,y?. The resulting R-squared, say Rj = 1 — SSR/SSTp, 
is sometimes called the uncentered R-squared. Unfortunately, Rj is rarely suitable as a goodness- 
of-fit measure. It is always true that SST) = SST with equality only if y = 0. Often, SST, is much 
larger than SST, which means that Rj is much larger than R’. For example, if in the previous example 
we regress log(wage) on marrmale, singmale, marrfem, singfem, and the other explanatory variables— 
without an intercept—the reported R-squared from Stata, which is Ré, is .948. This high R-squared is 
an artifact of not centering the total sum of squares in the calculation. The correct R-squared is given 
in equation (7.11) as .461. Some regression packages, including Stata, have an option to force calcu- 
lation of the centered R-squared even though an overall intercept has not been included, and using 
this option is generally a good idea. In the vast majority of cases, any R-squared based on comparing an 
SSR and SST should have SST computed by centering the y; about y. We can think of this SST as the 
sum of squared residuals obtained if we just use the sample average, y, to predict each y,. Surely we are 
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setting the bar pretty low for any model if all we measure is its fit relative to using a constant predictor. 
For a model without an intercept that fits poorly, it is possible that SSR > SST, which means R? would 
be negative. The uncentered R-squared will always be between zero and one, which likely explains why 
it is usually the default when an intercept is not estimated in regression models. 


7-3a Incorporating Ordinal Information by Using Dummy Variables 


Suppose that we would like to estimate the effect of city credit ratings on the municipal bond interest 
rate (MBR). Several financial companies, such as Moody’s Investors Service and Standard and Poor’s, 
rate the quality of debt for local governments, where the ratings depend on things like probability of 
default. (Local governments prefer lower interest rates in order to reduce their costs of borrowing.) 
For simplicity, suppose that rankings take on the integer values {0, 1, 2, 3, 4}, with zero being the 
worst credit rating and four being the best. This is an example of an ordinal variable. Call this vari- 
able CR for concreteness. The question we need to address is: How do we incorporate the variable CR 
into a model to explain MBR? 
One possibility is to just include CR as we would include any other explanatory variable: 


MBR = By + B,CR + other factors, 


where we do not explicitly show what other factors are in the model. Then £; is the percentage point 
change in MBR when CR increases by one unit, holding other factors fixed. Unfortunately, it is rather 
hard to interpret a one-unit increase in CR. We know the quantitative meaning of another year of edu- 
cation, or another dollar spent per student, but things like credit ratings typically have only ordinal 
meaning. We know that a CR of four is better than a CR of three, but is the difference between four 
and three the same as the difference between one and zero? If not, then it might not make sense to 
assume that a one-unit increase in CR has a constant effect on MBR. 

A better approach, which we can implement because CR takes on relatively few values, is to 
define dummy variables for each value of CR. Thus, let CR; = 1 if CR = 1, and CR, = 0 otherwise; 
CR, = 1 if CR = 2, and CR, = 0 otherwise; and so on. Effectively, we take the single credit rating 
and turn it into five categories. Then, we can estimate the model 


MBR = By + 6,CR, + CR, + CR; + 64CR, + other factors. [7.12] 


Following our rule for including dummy variables in a model, we include four dummy variables 
because we have five categories. The omitted category here is a credit rating of zero, and so it is the 
base group. (This is why we do not need to define a 
dummy variable for this category.) The coefficients 

GOING FURTHER 7.3 are easy to interpret: 6, is the difference in MBR 
In model (7.12), how would you test the null (other factors fixed) between a municipality with a 
hypothesis that credit rating has no effecton | credit rating of one and a municipality with a credit 
MBR? rating of zero; ô, is the difference in MBR between a 
municipality with a credit rating of two and a munic- 
ipality with a credit rating of zero; and so on. The 
movement between each credit rating is allowed to have a different effect, so using (7.12) is much 
more flexible than simply putting CR in as a single variable. Once the dummy variables are defined, 
estimating (7.12) is straightforward. 

Equation (7.12) contains the model with a constant partial effect as a special case. One 
way to write the three restrictions that imply a constant partial effect is 6, = 26,, ô = 36), 
and 6, = 46,. When we plug these into equation (7.12) and rearrange, we get MBR = By + 
8,(CR, + 2CR, + 3CR, + 4CR,) + other factors. Now, the term multiplying 6, is simply the origi- 
nal credit rating variable, CR. To obtain the F statistic for testing the constant partial effect restrictions, 
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we obtain the unrestricted R-squared from (7.12) and the restricted R-squared from the regression of 
MBR on CR and the other factors we have controlled for. The F statistic is obtained as in equation 
(4.41) with q = 3. 


Effects of Physical Attractiveness on Wage 


Hamermesh and Biddle (1994) used measures of physical attractiveness in a wage equation. (The 
file BEAUTY contains fewer variables but more observations than used by Hamermesh and Biddle. 
See Computer Exercise C12.) Each person in the sample was ranked by an interviewer for physi- 
cal attractiveness, using five categories (homely, quite plain, average, good looking, and strikingly 
beautiful or handsome). Because there are so few people at the two extremes, the authors put people 
into one of three groups for the regression analysis: average, below average, and above average, 
where the base group is average. Using data from the 1977 Quality of Employment Survey, after 
controlling for the usual productivity characteristics, Hamermesh and Biddle estimated an equation 
for men: 


—_—_—_— ~~ A 
log(wage) = Bo — .164 belavg + .016 abvavg + other factors 
(.046) (.033) 
n = 700, R? = .403 


and an equation for women: 


—_— ~~ aA 
log(wage) = By — .124 belavg + .035 abvavg + other factors 
(.066) (.049) 
n = 409, R? = .330. 


The other factors controlled for in the regressions include education, experience, tenure, marital 
status, and race; see Table 3 in Hamermesh and Biddle’s paper for a more complete list. In order 
to save space, the coefficients on the other variables are not reported in the paper and neither is the 
intercept. 

For men, those with below average looks are estimated to earn about 16.4% less than an average- 
looking man who is the same in other respects (including education, experience, tenure, marital sta- 
tus, and race). The effect is statistically different from zero, with t = —3.57. Men with above average 
looks are estimated to earn only 1.6% more than men with average looks, and the effect is not statisti- 
cally significant (t < .5). 

A woman with below average looks earns about 12.4% less than an otherwise comparable 
average-looking woman, with t = —1.88. As was the case for men, the estimate on abvavg is much 
smaller in magnitude and not statistically different from zero. 

In related work, Biddle and Hamermesh (1998) revisit the effects of looks on earnings using a 
more homogeneous group: graduates of a particular law school. The authors continue to find that 
physical appearance has an effect on annual earnings, something that is perhaps not too surprising 
among people practicing law. 


In some cases, the ordinal variable takes on too many values so that a dummy variable cannot be 
included for each value. For example, the file LAWSCH85 contains data on median starting salaries 
for law school graduates. One of the key explanatory variables is the rank of the law school. Because 
each law school has a different rank, we clearly cannot include a dummy variable for each rank. If we 
do not wish to put the rank directly in the equation, we can break it down into categories. The follow- 
ing example shows how this is done. 
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EXAMPLE 7.8 Effects of Law School Rankings on Starting Salaries 


Define the dummy variables top10, r11_25, r26_40, r41_60, r61_100 to take on the value unity when 
the variable rank falls into the appropriate range. We let schools ranked below 100 be the base group. 
The estimated equation is 
Jog(salary) = 9.17 + .700 top10 + .594 r11_25 + .375 r26_40 
(.41) (.053) (.039) (.034) 
+ .263 r41_60 + .132 761_100 + .0057 LSAT 
(.028) (.021) (.0031) [7.13] 
+ .041 GPA + .036 log(libvol) + .0008 log(cost) 
(.074) (.026) (.0251) 
n = 136, R? = 911, R? = .905. 


We see immediately that all of the dummy variables defining the different ranks are very statisti- 
cally significant. The estimate on r6/_/00 means that, holding LSAT, GPA, libvol, and cost fixed, the 
median salary at a law school ranked between 61 and 100 is about 13.2% higher than that at a law 
school ranked below 100. The difference between a top 10 school and a below 100 school is quite 
large. Using the exact calculation given in equation (7.10) gives exp(.700) — 1 ~ 1.014, and so the 
predicted median salary is more than 100% higher at a top 10 school than it is at a below 100 school. 

As an indication of whether breaking the rank into different groups is an improvement, we can 
compare the adjusted R-squared in (7.13) with the adjusted R-squared from including rank as a single 
variable: the former is .905 and the latter is .836, so the additional flexibility of (7.13) is warranted. 

Interestingly, once the rank is put into the (admittedly somewhat arbitrary) given categories, all of 
the other variables become insignificant. In fact, a test for joint significance of LSAT, GPA, log(libvol), 
and log(cost) gives a p-value of .055, which is borderline significant. When rank is included in its 
original form, the p-value for joint significance is zero to four decimal places. 

One final comment about this example: In deriving the properties of ordinary least squares, we 
assumed that we had a random sample. The current application violates that assumption because of the 
way rank is defined: a school’s rank necessarily depends on the rank of the other schools in the sample, 
and so the data cannot represent independent draws from the population of all law schools. This does 
not cause any serious problems provided the error term is uncorrelated with the explanatory variables. 


7-4 Interactions Involving Dummy Variables 


7-4a Interactions among Dummy Variables 


Just as variables with quantitative meaning can be interacted in regression models, so can dummy 
variables. We have effectively seen an example of this in Example 7.6, where we defined four catego- 
ries based on marital status and gender. In fact, we can recast that model by adding an interaction 
term between female and married to the model where female and married appear separately. This 
allows the marriage premium to depend on gender, just as it did in equation (7.11). For purposes of 
comparison, the estimated model with the female-married interaction term is 


log(wage) = 321 — .110 female + .231 married 
(.100) (.056) (055) 
— .301 female-married +--+ 
(072) 


[7.14] 
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where the rest of the regression is necessarily identical to (7.11). Equation (7.14) shows explicitly 
that there is a statistically significant interaction between gender and marital status. This model also 
allows us to obtain the estimated wage differential among all four groups, but here we must be careful 
to plug in the correct combination of zeros and ones. 

Setting female = 0 and married = 0 corresponds to the group single men, which is the base 
group, as this eliminates female, married, and female. married. We can find the intercept for married 
men by setting female = 0 and married = | in (7.14); this gives an intercept of .321 + .213 = .534, 
and so on. 

Equation (7.14) is just a different way of finding wage differentials across all gender—marital 
status combinations. It allows us to easily test the null hypothesis that the gender differential does 
not depend on marital status (equivalently, that the marriage differential does not depend on gender). 
Equation (7.11) is more convenient for testing for wage differentials between any group and the base 
group of single men. 


Effects of Computer Usage on Wages 


Krueger (1993) estimates the effects of computer usage on wages. He defines a dummy variable, 
which we call compwork, equal to one if an individual uses a computer at work. Another dummy 
variable, comphome, equals one if the person uses a computer at home. Using 13,379 people from the 
1989 Current Population Survey, Krueger (1993, Table 4) obtains 


—_— ~~ A 
log(wage) = Bo + .177 compwork + .070 comphome 


(.009) (.019) 17.15] 
+ .017 compwork:comphome + other factors. ` 
(.023) 


(The other factors are the standard ones for wage regressions, including education, experience, gender, 
and marital status; see Krueger’s paper for the exact list.) Krueger does not report the intercept because 
it is not of any importance; all we need to know is that the base group consists of people who do not 
use a computer at home or at work. It is worth noticing that the estimated return to using a computer at 
work (but not at home) is about 17.7%. (The more precise estimate is 19.4%.) Similarly, people who 
use computers at home but not at work have about a 7% wage premium over those who do not use a 
computer at all. The differential between those who use a computer at both places, relative to those 
who use a computer in neither place, is about 26.4% (obtained by adding all three coefficients and mul- 
tiplying by 100), or the more precise estimate 30.2% obtained from equation (7.10). 

The interaction term in (7.15) is not statistically significant, nor is it very big economically. But it 
is causing little harm by being in the equation. 


7-4b Allowing for Different Slopes 


We have now seen several examples of how to allow different intercepts for any number of groups 
in a multiple regression model. There are also occasions for interacting dummy variables with 
explanatory variables that are not dummy variables to allow for difference in slopes. Continuing 
with the wage example, suppose that we wish to test whether the return to education is the same for 
men and women, allowing for a constant wage differential between men and women (a differential 
for which we have already found evidence). For simplicity, we include only education and gender 
in the model. What kind of model allows for different returns to education? Consider the model 


log(wage) = (Bo + ôofemale) + (B; + 6; female)educ + u. [7.16] 
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If we plug female = 0 into (7.16), then we find that the intercept for males is By, and the slope on 
education for males is 64. For females, we plug in female = 1; thus, the intercept for females is 
Bo + ôo, and the slope is B, + ô. Therefore, 5) measures the difference in intercepts between women 
and men, and 6, measures the difference in the return to education between women and men. Two of 
the four cases for the signs of ô and 6, are presented in Figure 7.2. 

Graph (a) shows the case where the intercept for women is below that for men, and the slope of the line 
is smaller for women than for men. This means that women earn less than men at all levels of education, 
and the gap increases as educ gets larger. In graph (b), the intercept for women is below that for men, but the 
slope on education is larger for women. This means that women earn less than men at low levels of educa- 
tion, but the gap narrows as education increases. At some point, a woman earns more than a man with the 
same level of education, and this amount of education is easily found once we have the estimated equation. 

How can we estimate model (7.16)? To apply OLS, we must write the model with an interaction 
between female and educ: 


log(wage) = Bo + Sofemale + B,educ + 5, female-educ + u. [7.17] 


The parameters can now be estimated from the regression of log(wage) on female, educ, and 
female-educ. Obtaining the interaction term is easy in any regression package. Do not be daunted by 
the odd nature of female-educ, which is zero for any man in the sample and equal to the level of edu- 
cation for any woman in the sample. 

An important hypothesis is that the return to education is the same for women and men. In terms 
of model (7.17), this is stated as Hy: ô, = 0, which means that the slope of log(wage) with respect to 
educ is the same for men and women. Note that this hypothesis puts no restrictions on the difference 
in intercepts, 5). A wage differential between men and women is allowed under this null, but it must 
be the same at all levels of education. This situation is described by Figure 7.1. 

We are also interested in the hypothesis that average wages are identical for men and women who 
have the same levels of education. This means that ô and 6, must both be zero under the null hypothe- 
sis. In equation (7.17), we must use an F test to test Hp: 69 = 0, 6; = 0. In the model with just an inter- 
cept difference, we reject this hypothesis because Ho: 6) = 0 is soundly rejected against H;: 6) < 0. 


FIGURE 7.2 Graphs of equation (7.16): (a) 69 < 0, 6; < 0; (b) ô < 0, ô; > 0. 
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Log Hourly Wage Equation 
We add quadratics in experience and tenure to (7.17): 


—_— ~~ 
log(wage) = .389 — .227 female + .082 educ 


(.119) (.168) (.008) 
— ,0056 female-educ + .029 exper — .00058 exper 

(.0131) (.005) (.00011) [7.18] 
+ .032 tenure — .00059 tenure? 

(.007) (.00024) 


n = 526, R? = 441. 


The estimated return to education for men in this equation is .082, or 8.2%. For women, it is 
.082 — .0056 = .0764, or about 7.6%. The difference, —.56%, or just over one-half a percent- 
age point less for women, is not economically large nor statistically significant: the f statistic is 
—.0056/.0131 ~ —.43. Thus, we conclude that there is no evidence against the hypothesis that the 
return to education is the same for men and women. 

The coefficient on female, while remaining economically large, is no longer significant at con- 
ventional levels (ft = — 1.35). Its coefficient and f statistic in the equation without the interaction were 
—.297 and —8.25, respectively [see equation (7.9)]. Should we now conclude that there is no statisti- 
cally significant evidence of lower pay for women at the same levels of educ, exper, and tenure? This 
would be a serious error. Because we have added the interaction female-educ to the equation, the coef- 
ficient on female is now estimated much less precisely than it was in equation (7.9): the standard error 
has increased by almost fivefold (.168/.036 = 4.67). This occurs because female and female-educ 
are highly correlated in the sample. In this example, there is a useful way to think about the multicol- 
linearity: in equation (7.17) and the more general equation estimated in (7.18), 6) measures the wage 
differential between women and men when educ = 0. Very few people in the sample have very low 
levels of education, so it is not surprising that we have a difficult time estimating the differential at 
educ = 0 (nor is the differential at zero years of education very informative). More interesting would 
be to estimate the gender differential at, say, the average education level in the sample (about 12.5). 
To do this, we would replace female-educ with female-(educ —12.5) and rerun the regression; this 
only changes the coefficient on female and its standard error. (See Computer Exercise C7.) 

If we compute the F statistic for Hp: 69 = 0, 5; = 0, we obtain F = 34.33, which is a huge value 
for an F random variable with numerator df = 2 and denominator df = 518: the p-value is zero to 
four decimal places. In the end, we prefer model (7.9), which allows for a constant wage differential 
between women and men. 


GOING FURTHER 7.4 As a more complicated example involving interac- 
tions, we now look at the effects of race and city 
How would you augment the model esti- | racial composition on major league baseball player 


mated in (7.18) to allow the return to tenure salaries. 
to differ by gender? 


Effects of Race on Baseball Player Salaries 


Using MLB1, the following equation is estimated for the 330 major league baseball players for which 
city racial composition statistics are available. The variables black and hispan are binary indicators 
for the individual players. (The base group is white players.) The variable percbick is the percent- 
age of the team’s city that is black, and perchisp is the percentage of Hispanics. The other variables 
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measure aspects of player productivity and longevity. Here, we are interested in race effects after 
controlling for these other factors. 

In addition to including black and hispan in the equation, we add the interactions black-percblck 
and hispan-perchisp. The estimated equation is 


a ee 
log(salary) = 10.34 + .0673 years + .0089 gamesyr 


(2.18) (.0129) (.0034) 
+ .00095 bavg + .0146 hrunsyr + .0045 rbisyr 
(.00151) (.0164) (.0076) 
+ .0072 runsyr + .0011 fldperc + .0075 allstar 
(.0046) (.0021) (.0029) 
— .198 black — .190 hispan + .0125 black-percblck 
(.125) (.153) (.0050) 
+ .0201 hispan:perchisp 
(.0098) 
n = 330, R? = .638. [7.19] 


First, we should test whether the four race variables, black, hispan, black-percblck, and hispan-perchisp, 
are jointly significant. Using the same 330 players, the R-squared when the four race variables are 
dropped is .626. Because there are four restrictions and df = 330 — 13 in the unrestricted model, the 
F statistic is about 2.63, which yields a p-value of .034. Thus, these variables are jointly significant at 
the 5% level (though not at the 1% level). 

How do we interpret the coefficients on the race variables? In the following discussion, all pro- 
ductivity factors are held fixed. First, consider what happens for black players, holding perchisp fixed. 
The coefficient —.198 on black literally means that, if a black player is in a city with no blacks 
(percblck = 0), then the black player earns about 19.8% less than a comparable white player. As per- 
cblck increases—which means the white population decreases, because perchisp is held fixed—the 
salary of blacks increases relative to that for whites. In a city with 10% blacks, log(salary) for blacks 
compared to that for whites is —.198 + .0125(10) = —.073, so salary is about 7.3% less for blacks 
than for whites in such a city. When percblck = 20, blacks earn about 5.2% more than whites. The 
largest percentage of blacks in a city is about 74% (Detroit). 

Similarly, Hispanics earn less than whites in cities with a low percentage of Hispanics. But we 
can easily find the value of perchisp that makes the differential between whites and Hispanics equal 
zero: it must make —.190 + .0201 perchisp = 0, which gives perchisp ~ 9.45. For cities in which 
the percentage of Hispanics is less than 9.45%, Hispanics are predicted to earn less than whites (for 
a given black population), and the opposite is true if the percentage of Hispanics is above 9.45%. 
Twelve of the 22 cities represented in the sample have Hispanic populations that are less than 9.45% 
of the total population. The largest percentage of Hispanics is about 31%. 

How do we interpret these findings? We cannot simply claim discrimination exists against 
blacks and Hispanics, because the estimates imply that whites earn less than blacks and Hispanics 
in cities heavily populated by minorities. The importance of city composition on salaries might be 
due to player preferences: perhaps the best black players live disproportionately in cities with more 
blacks and the best Hispanic players tend to be in cities with more Hispanics. The estimates in (7.19) 
allow us to determine that some relationship is present, but we cannot distinguish between these two 
hypotheses. 
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7-4c Testing for Differences in Regression Functions across Groups 


The previous examples illustrate that interacting dummy variables with other independent variables 
can be a powerful tool. Sometimes, we wish to test the null hypothesis that two populations or groups 
follow the same regression function, against the alternative that one or more of the slopes differ across 
the groups. We will also see examples of this in Chapter 13, when we discuss pooling different cross 
sections over time. 

Suppose we want to test whether the same regression model describes college grade point aver- 
ages for male and female college athletes. The equation is 


cumgpa = By + Bisat + Bohsperc + B3tothrs + u, 


where sat is SAT score, hsperc is high school rank percentile, and tothrs is total hours of college 
courses. We know that, to allow for an intercept difference, we can include a dummy variable for 
either males or females. If we want any of the slopes to depend on gender, we simply interact the 
appropriate variable with, say, female, and include it in the equation. 

If we are interested in testing whether there is any difference between men and women, then we 
must allow a model where the intercept and all slopes can be different across the two groups: 


cumgpa = By + do female + B,sat + 6,female-sat + B,hsperc 
+ 6,female-hsperc + B3tothrs + 63female-tothrs + u. 


[7.20] 
The parameter 6, is the difference in the intercept between women and men, 6, is the slope difference 
with respect to sat between women and men, and so on. The null hypothesis that cumgpa follows the 
same model for males and females is stated as 


Hy: ĉo = 0, 8, = 0, 8, = 0,8, = 0. [7.21] 


If one of the 6; is different from zero, then the model is different for men and women. 
Using the spring semester data from the file GPA3, the full model is estimated as 


—_— —“—- 
cumgpa = 1.48 — .353 female + .0011 sat + .00075 female-sat 


(0.21) (.411) (.0002) (.00039) 

—.0085 hsperc — .00055 female-hsperc + .0023 tothrs 

(.0014) (.00316) (.0009) [7.22] 
— 00012 female-tothrs 

(.00163) 


n = 366, R? = .406, R = .394. 


None of the four terms involving the female dummy variable is very statistically significant; only the 
female-sat interaction has a f statistic close to two. But we know better than to rely on the individual 
t statistics for testing a joint hypothesis such as (7.21). To compute the F statistic, we must estimate the 
restricted model, which results from dropping female and all of the interactions; this gives an R? (the 
restricted R’) of about .352, so the F statistic is about 8.14; the p-value is zero to five decimal places, 
which causes us to soundly reject (7.21). Thus, men and women athletes do follow different GPA 
models, even though each term in (7.22) that allows women and men to be different is individually 
insignificant at the 5% level. 


238 PART 1 Regression Analysis with Cross-Sectional Data 


The large standard errors on female and the interaction terms make it difficult to tell exactly 
how men and women differ. We must be very careful in interpreting equation (7.22) because, in 
obtaining differences between women and men, the interaction terms must be taken into account. 
If we look only at the female variable, we would wrongly conclude that cumgpa is about .353 
less for women than for men, holding other factors fixed. This is the estimated difference only 
when sat, hsperc, and tothrs are all set to zero, which is not close to being a possible scenario. At 
sat = 1,100, hsperc = 10, and tothrs = 50, the predicted difference between a woman and a man 
is —.353 + .00075(1,100) — .00055(10) — .00012(50) ~ .461. That is, the female athlete is pre- 
dicted to have a GPA that is almost one-half a point higher than the comparable male athlete. 

In a model with three variables, sat, hsperc, and tothrs, it is pretty simple to add all of the inter- 
actions to test for group differences. In some cases, many more explanatory variables are involved, 
and then it is convenient to have a different way to compute the statistic. It turns out that the sum of 
squared residuals form of the F statistic can be computed easily even when many independent vari- 
ables are involved. 

In the general model with k explanatory variables and an intercept, suppose we have two groups; 
call them g = 1 and g = 2. We would like to test whether the intercept and all slopes are the same 
across the two groups. Write the model as 


y= Boo T Bg, 1x1 + Bg, 2X2 cae Bo, Xk +u, [7.23] 


for g = 1 and g = 2. The hypothesis that each beta in (7.23) is the same across the two groups involves 
k + 1restrictions (inthe GPA example, k + 1 = 4). The unrestricted model, which we can think of as hav- 
ing a group dummy variable and k interaction terms in addition to the intercept and variables themselves, 
has n — 2(k + 1) degrees of freedom. [In the GPA example, n — 2(k + 1) = 366 — 2(4) = 358.] 
So far, there is nothing new. The key insight is that the sum of squared residuals from the unrestricted 
model can be obtained from two separate regressions, one for each group. Let SSR, be the sum of 
squared residuals obtained estimating (7.23) for the first group; this involves n, observations. Let SSR, 
be the sum of squared residuals obtained from estimating the model using the second group (n, obser- 
vations). In the previous example, if group 1 is females, then n; = 90 and ny = 276. Now, the sum of 
squared residuals for the unrestricted model is simply SSR,,. = SSR, + SSR). The restricted sum of 
squared residuals is just the SSR from pooling the groups and estimating a single equation, say SSRp. 
Once we have these, we compute the F statistic as usual: 


SSRp — (SSR, + SSR, =2k +1 
p = (SSR ~ (SSR; + $5R,)] [n — 2(k + 1)] a 
SSR, + SSR, k+1 


where n is the total number of observations. This particular F statistic is usually called the Chow 
statistic in econometrics. Because the Chow test is just an F test, it is only valid under homoskedas- 
ticity. In particular, under the null hypothesis, the error variances for the two groups must be equal. As 
usual, normality is not needed for asymptotic analysis. 

To apply the Chow statistic to the GPA example, we need the SSR from the regression that pooled 
the groups together: this is SSRp = 85.515. The SSR for the 90 women in the sample is SSR, = 19.603, 
and the SSR for the men is SSR, = 58.752. Thus, SSR, = 19.603 + 58.752 = 78.355. The F sta- 
tistic is [(85.515 — 78.355)/78.355 |(358/4) = 8.18; of course, subject to rounding error, this is what 
we get using the R-squared form of the test in the models with and without the interaction terms. 
(A word of caution: there is no simple R-squared form of the test if separate regressions have been 
estimated for each group; the R-squared form of the test can be used only if interactions have been 
included to create the unrestricted model.) 

One important limitation of the traditional Chow test, regardless of the method used to imple- 
ment it, is that the null hypothesis allows for no differences at all between the groups. In many cases, 
it is more interesting to allow for an intercept difference between the groups and then to test for slope 
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differences; we saw one example of this in the wage equation in Example 7.10. There are two ways 
to allow the intercepts to differ under the null hypothesis. One is to include the group dummy and all 
interaction terms, as in equation (7.22), but then test joint significance of the interaction terms only. 
The second approach, which produces an identical statistic, is to form a sum-of-squared-residuals 
F statistic, as in equation (7.24), but where the restricted SSR, called “SSR,” in equation (7.24), is 
obtained using a regression that contains only an intercept shift. Because we are testing k restrictions, 
rather than k + 1, the F statistic becomes 


[SSR; — (SSR, + SSR,)] [n — 2(k + 1)] 
SSR, + SSR, 


Using this approach in the GPA example, SSR> is obtained from the regression cumgpa on female, 
sat, hsperc, and tothrs using the data for both male and female student athletes. 

Because there are relatively few explanatory variables in the GPA example, it is easy to estimate 
(7.20) and test Hy: 6; = 0, ô = 0, 6; = 0 (with ô unrestricted under the null). The F statistic for the 
three exclusion restrictions gives a p-value equal to .205, and so we do not reject the null hypothesis 
at even the 20% significance level. 

Failure to reject the hypothesis that the parameters multiplying the interaction terms are all zero 
suggests that the best model allows for an intercept difference only: 


Cumgpa = 1.39 + .310 female + .0012 sat — .0084 hsperc 
(.18) (.059) (.0002)  (.0012) 
+ .0025 tothrs [7.25] 
(.0007) 
n = 366, R° = .398, R° = .392. 


The slope coefficients in (7.25) are close to those for the base group (males) in (7.22); dropping the 
interactions changes very little. However, female in (7.25) is highly significant: its f statistic is over 5, 
and the estimate implies that, at given levels of sat, hsperc, and tothrs, a female athlete has a predicted 
GPA that is .31 point higher than that of a male athlete. This is a practically important difference. 


7-5 A Binary Dependent Variable: The Linear Probability Model 


By now, we have learned much about the properties and applicability of the multiple linear regression 
model. In the last several sections, we studied how, through the use of binary independent variables, 
we can incorporate qualitative information as explanatory variables in a multiple regression model. In 
all of the models up until now, the dependent variable y has had quantitative meaning (for example, 
y is a dollar amount, a test score, a percentage, or the logs of these). What happens if we want to use 
multiple regression to explain a qualitative event? 

In the simplest case, and one that often arises in practice, the event we would like to explain is 
a binary outcome. In other words, our dependent variable, y, takes on only two values: zero and one. 
For example, y can be defined to indicate whether an adult has a high school education; y can indicate 
whether a college student used illegal drugs during a given school year; or y can indicate whether a 
firm was taken over by another firm during a given year. In each of these examples, we can let y = 1 
denote one of the outcomes and y = 0 the other outcome. 

What does it mean to write down a multiple regression model, such as 


y= Bot Bix, ++ + By, + u, [7.26] 
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when y is a binary variable? Because y can take on only two values, 6; cannot be interpreted as the 
change in y given a one-unit increase in x;, holding all other factors fixed: y either changes from zero 
to one or from one to zero (or does not change), Nevertheless, the 6; still have useful interpretations. 
If we assume that the zero conditional mean assumption MLR.4 holds, that is, E(u|x,,...,.x,) = 0, 
then we have, as always, 


E(y|x) = Bo + Bix +o + BX 


where x is shorthand for all of the explanatory variables. 

The key point is that when y is a binary variable taking on the values zero and one, it is always 
true that P(y = 1|x) = E(y|x): the probability of “success”—that is, the probability that y = 1—is 
the same as the expected value of y. Thus, we have the important equation 


P(y = Ix) = Bo + Bixi + + BX [7.27] 


which says that the probability of success, say, p(x) = P(y = I|x), is a linear function of the x. 
Equation (7.27) is an example of a binary response model, and P(y = 1|x) is also called the response 
probability. (We will cover other binary response models in Chapter 17.) Because probabilities must 
sum to one, P(y = O|x) = 1 — P(y = 1x) is also a linear function of the x;. 

The multiple linear regression model with a binary dependent variable is called the linear prob- 
ability model (LPM) because the response probability is linear in the parameters {;, In the LPM, 6; 
measures the change in the probability of success when x; changes, holding other factors fixed: 


AP(y = 1|x) = BAx,. [7.28] 


With this in mind, the multiple regression model can allow us to estimate the effect of various explan- 
atory variables on qualitative events. The mechanics of OLS are the same as before. 
If we write the estimated equation as 


Y= Bo + Bixi +o + Berxe, 


we must now remember that ĵ is the predicted probability of success. Therefore, Bo is the predicted 
probability of success when each x; is set to zero, which may or may not be interesting. The slope 
coefficient Êi measures the predicted change in the probability of success when x, increases by 
one unit. 

To correctly interpret a linear probability model, we must know what constitutes a “success.” 
Thus, it is a good idea to give the dependent variable a name that describes the event y = 1. As an 
example, let in/f (“in the labor force”) be a binary variable indicating labor force participation by a 
married woman during 1975: inlf = 1 if the woman reports working for a wage outside the home at 
some point during the year, and zero otherwise. We assume that labor force participation depends on 
other sources of income, including husband’s earnings (nwifeinc, measured in thousands of dollars), 
years of education (educ), past years of labor market experience (exper), age, number of children less 
than six years old (kids/t6), and number of kids between 6 and 18 years of age (kidsge6). Using the 
data in MROZ from Mroz (1987), we estimate the following linear probability model, where 428 of 
the 753 women in the sample report were in the labor force at some point during 1975: 


inlf = 586 — .0034 nwifeinc + .038 educ + .039 exper 


(.154) (.0014) (.007) (.006) 
— ,00060 exper — .016 age — .262 kidslt6 + .013 kidsge6 [7.29] 
(.00018) (.002) (.034) (.013) 


n = 753, R? = .264. 
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Using the usual f statistics, all variables in (7.29) except kidsge6 are statistically significant, and all 
of the significant variables have the effects we would expect based on economic theory (or common 
sense). 

To interpret the estimates, we must remember that a change in the independent variable changes 
the probability that in/f = 1. For example, the coefficient on educ means that, everything else in 
(7.29) held fixed, another year of education increases the probability of labor force participation by 
.038. If we take this equation literally, 10 more years of education increases the probability of being 
in the labor force by .038(10) = .38, which is a pretty large increase in a probability. The relation- 
ship between the probability of labor force participation and educ is plotted in Figure 7.3. The other 
independent variables are fixed at the values nwifeinc = 50, exper = 5, age = 30, kidslt6 = 1, and 
kidsge6 = 0 for illustration purposes. The predicted probability is negative until education equals 
3.84 years. This should not cause too much concern because, in this sample, no woman has less 
than five years of education. The largest reported education is 17 years, and this leads to a predicted 
probability of .5. If we set the other independent variables at different values, the range of predicted 
probabilities would change. But the marginal effect of another year of education on the probability of 
labor force participation is always .038. 

The coefficient on nwifeinc implies that, if Anwifeinc = 10 (which means an increase of $10,000), 
the probability that a woman is in the labor force falls by .034. This is not an especially large effect 
given that an increase in income of $10,000 is substantial in terms of 1975 dollars. Experience has 
been entered as a quadratic to allow the effect of past experience to have a diminishing effect on the 
labor force participation probability. Holding other factors fixed, the estimated change in the probabil- 
ity is approximated as .039 — 2(.0006)exper = .039 — .0012 exper. The point at which past experi- 
ence has no effect on the probability of labor force participation is .039/.0012 = 32.5, which is a high 
level of experience: only 13 of the 753 women in the sample have more than 32 years of experience. 


FIGURE 7.3 Estimated relationship between the probability of being in the labor force 


and years of education, with other explanatory variables fixed. 
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Unlike the number of older children, the number of young children has a huge impact on labor 
force participation. Having one additional child less than six years old reduces the probability of par- 
ticipation by —.262, at given levels of the other variables. In the sample, just under 20% of the women 
have at least one young child. 

This example illustrates how easy linear probability models are to estimate and interpret, but 
it also highlights some shortcomings of the LPM. First, it is easy to see that, if we plug certain 
combinations of values for the independent variables into (7.29), we can get predictions either less 
than zero or greater than one. Because these are predicted probabilities, and probabilities must be 
between zero and one, this can be a little embarrassing. For example, what would it mean to pre- 
dict that a woman is in the labor force with a probability of —.10? In fact, of the 753 women in the 
sample, 16 of the fitted values from (7.29) are less than zero, and 17 of the fitted values are greater 
than one. 

A related problem is that a probability cannot be linearly related to the independent variables 
for all their possible values. For example, (7.29) predicts that the effect of going from zero children 
to one young child reduces the probability of working by .262. This is also the predicted drop if the 
woman goes from having one young child to two. It seems more realistic that the first small child 
would reduce the probability by a large amount, but subsequent children would have a smaller mar- 
ginal effect. In fact, when taken to the extreme, (7.29) implies that going from zero to four young 
children reduces the probability of working by Ain/f = .262(Akidslt6) = .262(4) = 1.048, which is 
impossible. 

Even with these problems, the linear probability model is useful and often applied in economics. 
It usually works well for values of the independent variables that are near the averages in the sample. 
In the labor force participation example, no women in the sample have four young children; in fact, 
only three women have three young children. Over 96% of the women have either no young children 
or one small child, and so we should probably restrict attention to this case when interpreting the 
estimated equation. 

Predicted probabilities outside the unit interval are a little troubling when we want 
to make predictions. Still, there are ways to use the estimated probabilities (even if some are 
negative or greater than one) to predict a zero-one outcome. As before, let ĵ; denote the fit- 
ted values—which may not be bounded between zero and one. Define a predicted value as 
¥, = lif), = Sand J; = 0 if ĵ; < .5. Now we have a set of predicted values, ¥;,i = 1,...,n, 
that, like the y,, are either zero or one. We can use the data on y; and ¥; to obtain the frequencies 
with which we correctly predict y; = 1 and y; = 0, as well as the proportion of overall correct 
predictions. The latter measure, when turned into a percentage, is a widely used goodness-of-fit 
measure for binary dependent variables: the percent correctly predicted. An example is given in 
Computer Exercise C9(v), and further discussion, in the context of more advanced models, can be 
found in Section 17-1. 

Due to the binary nature of y, the linear probability model does violate one of the Gauss-Markov 
assumptions. When y is a binary variable, its variance, conditional on x, is 


Var(ylx) = p(x)[1 — p(x) ], [7.30] 


where p(x) is shorthand for the probability of success: p(x) = By + Bixı + + B,x,. This means 
that, except in the case where the probability does not depend on any of the independent variables, 
there must be heteroskedasticity in a linear probability model. We know from Chapter 3 that this does 
not cause bias in the OLS estimators of the 6,. But we also know from Chapters 4 and 5 that homoske- 
dasticity is crucial for justifying the usual ¢ and F statistics, even in large samples. Because the stand- 
ard errors in (7.29) are not generally valid, we should use them with caution. We will show how to 
correct the standard errors for heteroskedasticity in Chapter 8. It turns out that, in many applications, 
the usual OLS statistics are not far off, and it is still acceptable in applied work to present a standard 
OLS analysis of a linear probability model. 
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A Linear Probability Model of Arrests 


Let arr86 be a binary variable equal to unity if a man was arrested during 1986, and zero otherwise. 
The population is a group of young men in California born in 1960 or 1961 who have at least one 
arrest prior to 1986. A linear probability model for describing arr86 is 


arr86 = By + Bypcnv + B avgsen + B3tottime + Byptimes6 + Bs;gemp86 + u, 
where 


pcnv = the proportion of prior arrests that led to a conviction. 
avgsen = the average sentence served from prior convictions (in months). 
tottime = months spent in prison since age 18 prior to 1986. 
ptime&6 = months spent in prison in 1986. 
gemp86 = the number of quarters (0 to 4) that the man was legally employed in 1986. 


The data we use are in CRIME1, the same data set used for Example 3.5. Here, we use a binary 
dependent variable because only 7.2% of the men in the sample were arrested more than once. About 
27.7% of the men were arrested at least once during 1986. The estimated equation is 


—_—— 
arr86 = 441 — .162 penv + .0061 avgsen — .0023 tottime 


(.017) (.021) (.0065) (.0050) 
— .022 ptime86 — .043 qemp86 [7.31] 
(.005) (.005) 


n = 2,725, R? = .0474. 


The intercept, .441, is the predicted probability of arrest for someone who has not been convicted 
(and so penv and avgsen are both zero), has spent no time in prison since age 18, spent no time in 
prison in 1986, and was unemployed during the entire year. The variables avgsen and tottime are 
insignificant both individually and jointly (the F test gives p-value = .347), and avgsen has a coun- 
terintuitive sign if longer sentences are supposed to deter crime. Grogger (1991), using a superset of 
these data and different econometric methods, found that tottime has a statistically significant positive 
effect on arrests and concluded that tottime is a measure of human capital built up in criminal activity. 

Increasing the probability of conviction does lower the probability of arrest, but we must 
be careful when interpreting the magnitude of the coefficient. The variable pcnv is a proportion 
between zero and one; thus, changing pcnv from zero to one essentially means a change from 
no chance of being convicted to being convicted with certainty. Even this large change reduces 
the probability of arrest only by .162; increasing pcnv by .5 decreases the probability of arrest 
by .081. 

The incarcerative effect is given by the coefficient on ptimeS6. If a man is in prison, he cannot be 
arrested. Because ptime86 is measured in months, six more months in prison reduces the probability 
of arrest by .022(6) = .132. Equation (7.31) gives another example of where the linear probability 
model cannot be true over all ranges of the independent variables. If a man is in prison all 12 months 
of 1986, he cannot be arrested in 1986. Setting all other variables equal to zero, the predicted proba- 
bility of arrest when ptime86 = 12 is .441 — .022(12) = .177, which is not zero. Nevertheless, if we 
start from the unconditional probability of arrest, .277, 12 months in prison reduces the probability to 
essentially zero: .277 — .022(12) = .013. 

Finally, employment reduces the probability of arrest in a significant way. All other factors fixed, a 
man employed in all four quarters is .172 less likely to be arrested than a man who is not employed at all. 
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We can also include dummy independent variables in models with dummy dependent 
variables. The coefficient measures the predicted difference in probability relative to the base 
group. For example, if we add two race dummies, black and hispan, to the arrest equation, we 
obtain 


(or 
arr86 = .380 — .152 pcnv + .0046 avgsen — .0026 tottime 


(.019) (.021) (.0064) (.0049) 
— .024 ptime86 — .038 gemp86 + .170 black + .096 hispan [7.32] 
(.005) (.005) (.024) (.021) 


n = 2,725, R? = 0682. 


The coefficient on black means that, all other factors 

GOING FURTHER 7.5 being equal, a black man has a .17 higher chance of 
What is the predicted probability of arrest for being arrested than a white man (the base group). 
a black man with no prior convictions—so | Another way to say this is that the probability of 
that pcnv, avgsen, tottime, and ptime86 are | arrest is 17 percentage points higher for blacks than 
all Zero—who was employed all four quar- | for whites. The difference is statistically significant 
rS in Wey Does iinis esem rezene as well. Similarly, Hispanic men have a .096 higher 
chance of being arrested than white men. 


7-6 More on Policy Analysis and Program Evaluation 


We have seen some examples of models containing dummy variables that can be useful for evaluating 
policy. Example 7.3 gave an example of program evaluation, where some firms received job training 
grants and others did not. 

As we mentioned earlier, we must be careful when evaluating programs because in most exam- 
ples in the social sciences the control and treatment groups are not randomly assigned. Consider again 
the Holzer et al. (1993) study, where we are now interested in the effect of the job training grants on 
worker productivity (as opposed to amount of job training). The equation of interest is 


log(scrap) = By + B,grant + Bplog(sales) + B3log(employ) +u, 


where scrap is the firm’s scrap rate, and the latter two variables are included as controls. The binary 
variable grant indicates whether the firm received a grant in 1988 for job training. 

Before we look at the estimates, we might be worried that the unobserved factors affecting 
worker productivity—such as average levels of education, ability, experience, and tenure—might be 
correlated with whether the firm receives a grant. Holzer et al. point out that grants were given on a 
first-come, first-served basis. But this is not the same as giving out grants randomly. It might be that 
firms with less productive workers saw an opportunity to improve productivity and therefore were 
more diligent in applying for the grants. 

Using the data in JTRAIN for 1988—when firms actually were eligible to receive the grants—we 
obtain 


Jog(scrap) = 4.99 — .052 grant — .455 log(sales) 
(4.66) (431) (373) 
+ .639 log(employ) [7.33] 
(365) 
n = 50, R? = 072. 
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(Seventeen out of the fifty firms received a training grant, and the average scrap rate is 3.47 across 
all firms.) The point estimate of —.052 on grant means that, for given sales and employ, firms receiv- 
ing a grant have scrap rates about 5.2% lower than firms without grants. This is the direction of the 
expected effect if the training grants are effective, but the ¢ statistic is very small. Thus, from this 
cross-sectional analysis, we must conclude that the grants had no effect on firm productivity. We will 
return to this example in Chapter 9 and show how adding information from a prior year leads to a 
much different conclusion. 

Even in cases where the policy analysis does not involve assigning units to a control group and 
a treatment group, we must be careful to include factors that might be systematically related to the 
binary independent variable of interest. A good example of this is testing for racial discrimination. 
Race is something that is not determined by an individual or by government administrators. In fact, 
race would appear to be the perfect example of an exogenous explanatory variable, given that it is 
determined at birth. However, for historical reasons, race is often related to other relevant factors: 
there are systematic differences in backgrounds across race, and these differences can be important in 
testing for current discrimination. 

As an example, consider testing for discrimination in loan approvals. If we can collect data on, 
say, individual mortgage applications, then we can define the dummy dependent variable approved 
as equal to one if a mortgage application was approved, and zero otherwise. A systematic difference 
in approval rates across races is an indication of discrimination. However, because approval depends 
on many other factors, including income, wealth, credit ratings, and a general ability to pay back the 
loan, we must control for them if there are systematic differences in these factors across race. A linear 
probability model to test for discrimination might look like the following: 


approved = By + B,nonwhite + B income + B3wealth + Bycredrate + other factors. 


Discrimination against nonwhites is indicated by a rejection of Hy: 8B, = 0 in favor of Hy: B, < 0, 
because B, is the amount by which the probability of a nonwhite getting an approval differs from the 
probability of a white getting an approval, given the same levels of other variables in the equation. If 
income, wealth, and so on are systematically different across races, then it is important to control for 
these factors in a multiple regression analysis. 


7-6a Program Evaluation and Unrestricted Regression Adjustment 


In Section 3-7e, in the context of potential outcomes, we derived an equation that can be used to test 
the effectiveness of a policy intervention or a program. Letting w again be the binary policy indicator 
and x4, X2,..., Xg the control variables, we obtained the following population regression function: 


E(ylw, x) = a + rw + xy = a + tw + yxy te + VX [7.34] 


where y = (1 — w)y(0) + wy(1) is the observed outcome and [y(0),y(1)] are the potential or counter- 
factual outcomes. 

The reason for including the x; in (7.34) is to account for the possibility that program partici- 
pation is not randomly assigned. The problem of participation decisions differing systematically 
by individual characteristics is often referred to as the self-selection problem, with “self being 
used broadly. For example, children eligible for programs such as Head Start participate largely 
based on parental decisions. Because family background and structure play a role in Head Start 
participation decisions, and they also tend to predict child outcomes, we should control for these 
socioeconomic factors when examining the effects of Head Start [see, for example, Currie and 
Thomas (1995)]. 

In the context of causal inference, the assumption that we have sufficient explanatory vari- 
ables so that, conditional on those variables, program participation is as good as random, is the 
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unconfoundedness or ignorability assumption introduced in Section 3-7e. When we are mainly inter- 
ested in estimating the effect of a program or intervention, as indicated by w,the explanatory variables 
X1,%,..., 4X, are often called covariates. These are factors that possibly vary with participation deci- 
sions as well as with potential outcomes. 

The self-selection problem is not restricted to decisions to participate in school or government 
programs. It is rampant when studying the economic and societal effects of certain behaviors. For 
example, individuals choose to use illegal drugs or to drink alcohol. If we want to examine the effects 
of such behaviors on unemployment status, earnings, or criminal behavior, we should be concerned 
that drug usage might be correlated with factors affecting potential labor or criminal outcomes. 
Without accounting for systematic differences between those who use drugs and those who do not, we 
are unlikely to obtain a convincing causal estimate of drug usage. 

Self-selection also can be an issue when studying more aggregate units. Cities and states choose 
whether to implement certain gun control laws, and it is likely that this decision is systematically 
related to other factors that affect violent crime [see, for example, Kleck and Patterson (1993)]. 
Hospitals choose to be for profit or nonprofit, and this decision may be related to hospital characteris- 
tics that affect patient health outcomes. 

Most program evaluations are still based on observational (or nonexperimental) data, and so esti- 
mating the simple equation 


y=atmwrtu [7.35] 


by OLS is unlikely to produce an unbiased or consistent estimate of the causal effect. By including 
suitable covariates, estimation of (7.34) is likely to be more convincing. In the context of program 
evaluation, using the regression 


Y; ON Wj, Xil, Xia. + + -s Xik i= 1, s.. [7.36] 


is a version of what is called regression adjustment, and 7, the coefficient on w; is the regression 
adjusted estimator. The idea is that we have used multiple regression with covariates x), X2, . . . , Xg to 
adjust for differences across units in estimating the causal effect. 

Recall from Section 3.7e that, in addition to unconfoundedness, equation (7.34) was obtained 
under the strong assumption of a constant treatment effect: T = y;(1)—y; (0) for all i. We are now in a 
position to relax this assumption. We still maintain the unconfoundedness, or conditional independ- 
ence, assumption, which we reproduce here for convenience: 


w is independent of [y(0), y(1)] conditional on x = (x4, .. . , X4) [7.37] 

We also assume the conditional means are linear, but now we allow completely separate equations for 
y(0) and y(1). Written in terms of errors u(0) and u(1), 

y(0) = po + (x = N)Yo + Mo = Wo + Yoni = M) + + Youle = m) + u(0) [7.38] 

yO) = p + (x = N)y, + m = pi + yia m) H + Yil m) + UCL) [7.39 

where n; = E(x;) is the population mean of x; Wọ = Ely(0)], and y, = E[y(1)]. The covariates x; 

have been centered about their means so that the intercepts, Wo and y}, are the expected values of the 


two potential outcomes. Equations (7.38) and (7.39) allow the treatment effect for unit i, y(1)—y,(0), 
to depend on observables x, and the unobservables. For unit i, the treatment effect is 


te; = yi(1) — y{0) = (ya = po) + (x: = mm — Yo) + [u(1) — u(0)] 
The average treatment effect, which we call 7 in this section, is T = y — po because 


E(te;) = (pi = Wo) + E{x; E n)(yı = Yo) + [u,(1) = u(0) |} 
=r +0. (yi =Y) +0=7, 
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where (x; — 7) has a zero mean by construction and u;(0), u;(1) has zero mean because they are the 
errors obtained from conditional expectations. The observed outcome 
yi = y0) + wily(1) — y0)] can be written as 


Yi = Wo + tw; + (x; — n)Yo + wx; — n) + uj(0) + wilu(1) — u,(0)] [7.40] 
where 6 = (y; — Yo). If we define 
u; = uO) + wilu1) — u;(0)] 
the unconfoundedness implies 


E(uj|w;,X;) = E[u,(0)|w;.x;] + wiE{[u;(1) = u,(0) |w x;} [7.41] 
= E[u,(0)|x;] + wiE{[u;(1) z u;(0) ]|x;} = 0. 


Equations (7.40) and (7.41) suggest that we run a regression that includes a full set of interactions 
between w, and demeaned controls. To implement this method, we also need to replace the unknown 
population means, 7, with the sample averages (across the entire sample of n observations), x;. This 
leads to the regression 


Y; OD Wa Xis- < Xp We © (xa — X), Wi (xk — X) [7.42] 


using all n observations. The coefficient on w;, 7, is the average causal effect or average treatment effect. 
We can determine how the treatment effect varies with the x; by multiplying (x;—x;) by the coefficient 
on the interaction term, ô.. Note that we do not have to dëmean the x; when they appear by themselves, 
as failing to do so only aee the intercept in the regression. But it is critical to demean the x; before 
constructing the interactions in order to obtain the average treatment effect as the coefficient on w;. 

The estimateof 7 from (7.42) will be different than that from (7.36), as (7.36) omits the k inter- 
action terms. In the literature, the phrase “regression adjustment” often refers to the more flexible 
regression in (7.42). For emphasis, one can use the terms restricted regression adjustment (RRA) (and 
use the notation 7,,,,) and unrestricted regression adjustment (URA) (using 7,,,,) for (7.36) and (7.42), 
respectively. 

It turns out that the estimate 7 from (7.42) can be obtained from two separate regressions, just as 
when computing the Chow statistic from Section 7.4c. Working through the details is informative, as 
it emphasizes the counterfactual nature of the unrestricted regression adjustment. First, we run regres- 
sions separately for the control and treatment groups. For the control group, we use the ny observa- 
tions with w; = 0 and run the regression 


rra ura. 


Yi OD Xis Xiz - + + > Xik 
and obtain the interercept @ and k slope estimates ĵọ1, Yor, - - - » Pox- We do the same using the n, 
observations with w, = 1, and obtain the intercept â; and the slopes 7; 1; fiz- - -391x 


Now here is where we use counterfactual reasoning: for every unit i in the sample, we predict 
y;(0) and y,(1) regardless of whether the unit was in the control or treatment group. Define the pre- 
dicted values as 


<(0) 


Xi = Qo + Yorkin + VorkXi2 T + YokXik 


a(1) _ A A A a 
a Sa + Vika + Vix Fo + Vi Xin 
for all 7. In other words, we plug the explanatory variables for unit i into both regression functions to 


predict the outcomes in the two states of the world: the control state and the treated state. It is then 
natural to estimate the average treatment effect as 


n' 2 (SP - 9] = (a, _ ĉo) + Pri = 9o1)% “eee =F (Fiz = Poz) he [7.43] 
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With some algebra, one can show that (7.43) produces the 7 from regression (7.40). Thus, two seem- 
ingly different ways of using multiple regression to adjust for differences in units lead to the same 
estimate of the ATE. 

Most regression packages are designed to make calculation of (7.43) easy, because they compute 
a predicted value for all observations that have information on the {x,: j = 1,..., k} whether or not 
unit 7 was used in the estimation. However, obtaining a proper standard error when computing (7.43) 
“by hand” can be tricky, although some econometrics software has the calculation built in. Regression 
(7.42) always produces a valid standard error of 7 but requires obtaining the k interaction terms after 
demeaning each x. Incidentally, we can obtain the Chow test that allows different intercepts by test- 
ing whether the k interaction terms are jointly significant using an F statistic. If we fail to reject, we 
could return to the regression (7.36) that imposes common slopes. 


Evaluating a Job Training Program using Unrestricted Regression Adjustment 


The data in JTRAIN98 were used in Examples 3.11 and 4.11 to estimate the effects of a job train- 
ing program. The variable we would like to explain, y = earn98, is labor market earnings in 1998, 
the year following the job training program (which took place in 1997). The earnings variable is 
measured in thousands of dollars. The variable w = train is the binary participation (or “treatment’”) 
indicator. We use the same controls as in Example 4.1 1—earn96, educ, age, and married—but now 
we use unrestricted regression adjustment. For comparison purposes, the simple difference-in-means 
estimate iS Taigmeams = —2-05 (se = 0.48) and the restricted regression adjusted estimate, reported in 
equation (4.52), is 7,,, = 2.44 (se = 0.44). The estimated equation with full interactions is 


—_— ~_= 


earn98 = 5.08 + 3.11 train + .353 earn96 + .378 educ — .196 age + 2.76 married 
(1.39) (0.53) (.020) (.078) (.023) (0.55) 
+ .133 train * (earn96 — earn96) — .035 train * (educ — educ) [7.44] 
(0.054) C137) 


+ .058 train * (age — age) — .993 train * (married — married) 
(.041) (.883) 


n = 1, 130, R? = 0.409. 


The estimated average treatment effect is the coefficient on train, 7,4 = 3.11 (se = 0.53), which is 
very Statistically significant with ¢,,,;, > 5.8. It is also notably higher than the restricted RA estimate, 
although the F statistic for joint significance of the interaction terms gives p-value ~ 0.113, and so the 
interaction terms are not jointly significant at the 10% level. 


Example 7.13 warrants some final comments. First, to obtain the average treatment effect as 
the coefficient on train, all explanatory variables, including the dummy variable married, must be 
demeaned before creating the interaction term with train. Using train + married in place of the final 
interaction forces the coefficient on train to be the average treatment effect for unmarried men— 
where the averages are across educ96, educ, and age. The estimate turns out to be 3.79 (se = 0.81). 
The coefficient on train > married would be unchanged from that in (7.44), —. 993 (se = 0.883), and 
is still interpreted as the difference in the ATE between married and unmarried men. 

In using regression adjustment to estimate the effects of something like a job training program, 
there is always the possibility that our control variables, x1, x,..., X% are not sufficient for fully 
overcoming self-selection into participation. One must be on guard at all times unless w is known to 
have been randomized. With observational data, the possibilityof finding a spurious effect—in either 
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direction—is often quite high, even with a rich set of x; A good example of this is contained in 
Currie and Cole (1993). These authors examine the effect of AFDC (Aid to Families with Dependent 
Children) participation on the birth weight of a child. Even after controlling for a variety of family 
and background characteristics, the authors obtain OLS estimates that imply participation in AFDC 
lowers birth weight. As the authors point out, it is hard to believe that AFDC participation itself causes 
lower birth weight. [See Currie (1995) for additional examples.] 

Using a different econometric method that we discuss in Chapter 15, Currie and Cole find evi- 
dence for either no effect or a positive effect of AFDC participation on birth weight. When the self- 
selection problem causes standard multiple regression analysis to be biased due to a lack of sufficient 
control variables, the more advanced methods covered in Chapters 13,14, and 15, can be used instead. 


7-7 \nterpreting Regression Results with Discrete Dependent Variables 


A binary response is the most extreme form of a discrete random variable: it takes on only two val- 
ues, zero and one. As we discussed in Section 7-5, the parameters in a linear probability model can 
be interpreted as measuring the change in the probability that y = 1 due to a one-unit increase in an 
explanatory variable. We also discussed that, because y is a zero-one outcome, P(y = 1) = E(y), and 
this equality continues to hold when we condition on explanatory variables. 

Other discrete dependent variables arise in practice, and we have already seen some examples, 
such as the number of times someone is arrested in a given year (Example 3.5). Studies on factors 
affecting fertility often use the number of living children as the dependent variable in a regression 
analysis. As with number of arrests, the number of living children takes on a small set of integer val- 
ues, and zero is a common value. The data in FERTIL2, which contains information on a large sample 
of women in Botswana is one such example. Often demographers are interested in the effects of edu- 
cation on fertility, with special attention to trying to determine whether education has a causal effect 
on fertility. Such examples raise a question about how one interprets regression coefficients: after all, 
one cannot have a fraction of a child. 

To illustrate the issues, the regression below uses the data in FERTIL2: 


—_— nn 
children = —1.997 + .175 age — .090 educ 
(.094) (.003) (.006) [7.45] 
n = 4,361, R? = .560. 


At this time, we ignore the issue of whether this regression adequately controls for all factors that 
affect fertility. Instead we focus on interpreting the regression coefficients. 

Consider the main coefficient of interest, Beauc = —.090. If we take this estimate literally, it says 
that each additional year of education reduces the estimated number of children by .090—something 
obviously impossible for any particular woman. A similar problem arises when trying to interpret 
Bage = .175. How can we make sense of these coefficients? 

To interpret regression results generally, even in cases where y is discrete and takes on a small 
number of values, it is useful to remember the interpretation of OLS as estimating the effects of the x; 
on the expected (or average) value of y. Generally, under Assumptions MLR.1 and MLR.4, 


E(ylx1, Xz <--> Xx) = Bo + Bixi + + BX. [7.46] 


Therefore, 6; is the effect of a ceteris paribus increase of x; on the expected value of y. As we discussed 
in Section 6-4, for a given set of x; values we interpret the predicted value, Bo $ Ê eo oe e Bete as 
an estimate of E(y|x,, x, ..., Xp). Therefor, Bi is our estimate of how the average of y ae when 
Ax; = 1 (keeping other Pretons fixed). 
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Seen in this light, we can now provide meaning to regression results as in equation (7.45). The 
coefficient Beauc = —.090 means that we estimate that average fertility falls by .09 children given 
one more year of education. A nice way to summarize this interpretation is that if each woman in 
a group of 100 obtains another year of education, we estimate there will be nine fewer children 
among them. 

Adding dummy variables to regressions when y is itself discrete causes no problems when we 
interpret the estimated effect in terms of average values. Using the data in FERTIL2 we get 


e —_. 


children = —2.071 + .177 age — .079 educ — .362 electric [7.47] 
(.095) (.003) (.006) (.068) 
n = 4,358, R? = .562, 


where electric is a dummy variable equal to one if the woman lives in a home with electricity. Of 
course it cannot be true that a particular woman who has electricity has .362 less children than an 
otherwise comparable woman who does not. But we can say that when comparing 100 women with 
electricity to 100 women without—at the same age and level of education—we estimate the former 
group to have about 36 fewer children. 

Incidentally, when y is discrete the linear model does not always provide the best estimates of par- 
tial effects on E(y|x,, x», ..., Xx). Chapter 17 contains more advanced models and estimation methods 
that tend to fit the data better when the range of y is limited in some substantive way. Nevertheless, a 
linear model estimated by OLS often provides a good approximation to the true partial effects, at least 
on average. 


Summary 


In this chapter, we have learned how to use qualitative information in regression analysis. In the sim- 
plest case, a dummy variable is defined to distinguish between two groups, and the coefficient estimate 
on the dummy variable estimates the ceteris paribus difference between the two groups. Allowing for 
more than two groups is accomplished by defining a set of dummy variables: if there are g groups, then 
g — 1 dummy variables are included in the model. All estimates on the dummy variables are inter- 
preted relative to the base or benchmark group (the group for which no dummy variable is included in 
the model). 

Dummy variables are also useful for incorporating ordinal information, such as a credit or a beauty 
rating, in regression models. We simply define a set of dummy variables representing different outcomes of 
the ordinal variable, allowing one of the categories to be the base group. 

Dummy variables can be interacted with quantitative variables to allow slope differences across 
different groups. In the extreme case, we can allow each group to have its own slope on every vari- 
able, as well as its own intercept. The Chow test can be used to detect whether there are any dif- 
ferences across groups. In many cases, it is more interesting to test whether, after allowing for an 
intercept difference, the slopes for two different groups are the same. A standard F test can be used 
for this purpose in an unrestricted model that includes interactions between the group dummy and all 
variables. 

The linear probability model, which is simply estimated by OLS, allows us to explain a binary re- 
sponse using regression analysis. The OLS estimates are now interpreted as changes in the probability of 
“success” (y = 1), given a one-unit increase in the corresponding explanatory variable. The LPM does 
have some drawbacks: it can produce predicted probabilities that are less than zero or greater than one, 
it implies a constant marginal effect of each explanatory variable that appears in its original form, and it 
contains heteroskedasticity. The first two problems are often not serious when we are obtaining estimates 
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of the partial effects of the explanatory variables for the middle ranges of the data. Heteroskedasticity does 
invalidate the usual OLS standard errors and test statistics, but, as we will see in the next chapter, this is 
easily fixed in large enough samples. 

Section 7.6 provides a discussion of how binary variables are used to evaluate policies and programs. 
As in all regression analysis, we must remember that program participation, or some other binary regressor 
with policy implications, might be correlated with unobserved factors that affect the dependent variable, 
resulting in the usual omitted variables bias. 

We ended this chapter with a general discussion of how to interpret regression equations when the 
dependent variable is discrete. The key is to remember that the coefficients can be interpreted as the effects 
on the expected value of the dependent variable. 


Key Terms 


Base Group Dummy Variables Program Evaluation 
Benchmark Group Experimental Group Regression adjustment 
Binary Variable Interaction Term Response Probability 
Chow Statistic Intercept Shift Self-Selection Problem 
Control Group Linear Probability Model (LPM) Treatment Group 
Covariates Ordinal Variable Uncentered R-Squared 
Difference in Slopes Percent Correctly Predicted Zero-One Variable 
Dummy Variable Trap Policy Analysis 


Problems 


1 Using the data in SLEEP75 (see also Problem 3 in Chapter 3), we obtain the estimated equation 


Sleep = 3,840.83 — .163 totwrk — 11.71 educ — 8.70 age 
(235.11) (.018) (5.86) (11.21) 
+ .128 age” + 87.75 male 
(.134) (34.33) 
n = 706, R? = 123, R? = .117. 


The variable sleep is total minutes per week spent sleeping at night, totwrk is total weekly minutes 
spent working, educ and age are measured in years, and male is a gender dummy. 
Gi) All other factors being equal, is there evidence that men sleep more than women? How strong 
is the evidence? 
(ii) Is there a statistically significant tradeoff between working and sleeping? What is the estimated 
tradeoff? 
(iii) What other regression do you need to run to test the null hypothesis that, holding other factors 
fixed, age has no effect on sleeping? 


2 The following equations were estimated using the data in BWGHT: 


a 
log(bwght) = 4.66 — .0044 cigs + .0093 log(faminc) + .016 parity 
(.22) (.0009) (.0059) (.006) 


+ .027 male + .055 white 
(.010) (.013) 


n = 1,388, R? = .0472 
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and 


eg A 
log(bwght) = 4.65 — .0052 cigs + .0110 log(faminc) + .017 parity 


(.38) (.0010) (.0085) (.006) 
+ .034 male + .045 white — .0030 motheduc + .0032 fatheduc 
(.011) (.015) (.0030) (.0026) 


n = 1,191, R? = .0493. 


The variables are defined as in Example 4.9, but we have added a dummy variable for whether the 

child is male and a dummy variable indicating whether the child is classified as white. 

(i) In the first equation, interpret the coefficient on the variable cigs. In particular, what is the effect 
on birth weight from smoking 10 more cigarettes per day? 

(ii) How much more is a white child predicted to weigh than a nonwhite child, holding the other 
factors in the first equation fixed? Is the difference statistically significant? 

(iii) Comment on the estimated effect and statistical significance of motheduc. 

(iv) From the given information, why are you unable to compute the F statistic for joint significance 
of motheduc and fatheduc? What would you have to do to compute the F statistic? 


Using the data in GPA2, the following equation was estimated: 


sai = 1,028.10 + 19.30 hsize — 2.19 hsize’ — 45.09 female 


(6.29) (3.83) (.53) (4.29) 
— 169.81 black + 62.31 female: black 
(12.71) (18.15) 


n = 4,137, R? = .0858. 


The variable sat is the combined SAT score; hsize is size of the student’s high school graduating class, 
in hundreds; female is a gender dummy variable; and black is a race dummy variable equal to one for 
blacks, and zero otherwise. 

(i) Is there strong evidence that hsize” should be included in the model? From this equation, what is 
the optimal high school size? 

(ii) Holding hsize fixed, what is the estimated difference in SAT score between nonblack females 
and nonblack males? How statistically significant is this estimated difference? 

(iii) What is the estimated difference in SAT score between nonblack males and black males? Test 
the null hypothesis that there is no difference between their scores, against the alternative that 
there is a difference. 

(iv) What is the estimated difference in SAT score between black females and nonblack females? 
What would you need to do to test whether the difference is statistically significant? 


An equation explaining chief executive officer salary is 


P tte, 
log(salary) = 4.59 + .257 log(sales) + .011 roe + .158 finance 


(.30) (.032) (.004) (.089) 
+ .181 consprod — .283 utility 
(.085) (.099) 


n = 209, R? = .357. 


The data used are in CEOSAL1, where finance, consprod, and utility are binary variables indicating 
the financial, consumer products, and utilities industries. The omitted industry is transportation. 
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(i) | Compute the approximate percentage difference in estimated salary between the utility and 
transportation industries, holding sales and roe fixed. Is the difference statistically significant at 
the 1% level? 

(ii) Use equation (7.10) to obtain the exact percentage difference in estimated salary between the 
utility and transportation industries and compare this with the answer obtained in part (i). 

(iii) What is the approximate percentage difference in estimated salary between the consumer 
products and finance industries? Write an equation that would allow you to test whether the 
difference is statistically significant. 


In Example 7.2, let noPC be a dummy variable equal to one if the student does not own a PC, and zero 

otherwise. 

(Gi) IfnoPC is used in place of PC in equation (7.6), what happens to the intercept in the estimated 
equation? What will be the coefficient on noPC? (Hint: Write PC = 1 — noPC and plug this 
into the equation colIGPA = By + ôoPC + B,hsGPA + BACT.) 

(ii) What will happen to the R-squared if noPC is used in place of PC? 

(ii) Should PC and noPC both be included as independent variables in the model? Explain. 


To test the effectiveness of a job training program on the subsequent wages of workers, we specify the 
model 


log(wage) = By + Bytrain + B,educ + Bexper + u, 


where train is a binary variable equal to unity if a worker participated in the program. Think of the 
error term u as containing unobserved worker ability. If less able workers have a greater chance of 
being selected for the program, and you use an OLS analysis, what can you say about the likely bias in 
the OLS estimator of B,? (Hint: Refer back to Chapter 3.) 


In the example in equation (7.29), suppose that we define outlf to be one if the woman is out of the 

labor force, and zero otherwise. 

(i) If we regress outif on all of the independent variables in equation (7.29), what will happen to the 
intercept and slope estimates? (Hint: inlf = 1 — outlf. Plug this into the population equation 
inlf = Bo + B,nwifeinc + B,educ + **: and rearrange.) 

(ii) What will happen to the standard errors on the intercept and slope estimates? 

(iii) What will happen to the R-squared? 


Suppose you collect data from a survey on wages, education, experience, and gender. In addition, you 
ask for information about marijuana usage. The original question is: “On how many separate occasions 
last month did you smoke marijuana?” 

(i) Write an equation that would allow you to estimate the effects of marijuana usage on wage, 
while controlling for other factors. You should be able to make statements such as, “Smoking 
marijuana five more times per month is estimated to change wage by x%.” 

(ii) Write a model that would allow you to test whether drug usage has different effects on wages 
for men and women. How would you test that there are no differences in the effects of drug 
usage for men and women? 

(iii) Suppose you think it is better to measure marijuana usage by putting people into one of four 
categories: nonuser, light user (1 to 5 times per month), moderate user (6 to 10 times per 
month), and heavy user (more than 10 times per month). Now, write a model that allows you to 
estimate the effects of marijuana usage on wage. 

(iv) Using the model in part (iii), explain in detail how to test the null hypothesis that marijuana 
usage has no effect on wage. Be very specific and include a careful listing of degrees of 
freedom. 

(v) What are some potential problems with drawing causal inference using the survey data that you 
collected? 
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9 Let d be a dummy (binary) variable and let z be a quantitative variable. Consider the model 


10 


11 


y = Bo + Sod + Biz + ôd: z + u; 


this is a general version of a model with an interaction between a dummy variable and a quantitative 
variable. [An example is in equation (7.17).] 


(iv) 


Because it changes nothing important, set the error to zero, u = 0. Then, when d = 0 we 

can write the relationship between y and z as the function fo(z) = By + Bız. Write the same 
relationship when d = 1, where you should use f,(z) on the left-hand side to denote the linear 
function of z. 

Assuming that 6, # 0 (which means the two lines are not parallel), show that the value of z* 
such that fo(z*) = f,(z*) is z* = —6)/6,. This is the point at which the two lines intersect [as 
in Figure 7.2 (b)]. Argue that z* is positive if and only if ô and 6, have opposite signs. 

Using the data in TWOYEAR, the following equation can be estimated: 


—_— 
log(wage) = 2.289 — .357 female + .50 totcoll + .030 female: totcoll 
(0.011) (.015) (.003) (.005) 
n = 6,763, R? = .202, 


where all coefficients and standard errors have been rounded to three decimal places. Using this 
equation, find the value of totcoll such that the predicted values of log(wage) are the same for 
men and women. 

Based on the equation in part (iii), can women realistically get enough years of college so that 
their earnings catch up to those of men? Explain. 


For a child i living in a particular school district, let voucher; be a dummy variable equal to one if a 


child is selected to participate in a school voucher program, and let score; be that child’s score on a sub- 


sequent standardized exam. Suppose that the participation variable, voucher;, is completely randomized 
in the sense that it is independent of both observed and unobserved factors that can affect the test score. 


G) 


(ii) 


(iii) 


If you run a simple regression score; on voucher; using a random sample of size n, does the OLS 
estimator provide an unbiased estimator of the effect of the voucher program? 

Suppose you can collect additional background information, such as family income, family 
structure (e.g., whether the child lives with both parents), and parents’ education levels. Do you 
need to control for these factors to obtain an unbiased estimator of the effects of the voucher 
program? Explain. 

Why should you include the family background variables in the regression? Is there a situation 
in which you would not include the background variables? 


The following equations were estimated using the data in ECONMATH, with standard errors reported 
under coefficients. The average class score, measured as a percentage, is about 72.2; exactly 50% of 


the students are male; and the average of colgpa (grade point average at the start of the term) is about 


2.81. 


Score = 32.31 + 14.32 colgpa 
(2.00) (0.70) 
n = 856, R? = .329, R = 328. 


Score = 29.66 + 3.83 male + 14.57 colgpa 
(2.04) (0.74) (0.69) 
n = 856, R? = 340, R? = 348. 
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Scoré = 30.36 + 2.47 male + 14.33 colgpa + 0.479 male: colgpa 


(2.86) (3.96) (0.98) (1.383) 
n = 856, R? = .349, R? = .347. 


aaa 
score = 30.36 + 3.82 male + 14.33 colgpa + 0.479 male: (colgpa — 2.81) 
(2.86) (0.74) (0.98) (1.383) 
n = 856, R? = .349, R° = .347. 


(i) Interpret the coefficient on male in the second equation and construct a 95% confidence interval 
for Baie. Does the confidence interval exclude zero? 

(i) In the second equation, why is the estimate on male so imprecise? Should we now conclude 
that there are no gender differences in score after controlling for colgpa? [Hint: You might want 
to compute an F statistic for the null hypothesis that there is no gender difference in the model 
with the interaction. ] 

(iii) Compared with the third equation, why is the coefficient on male in the last equation so much 
closer to that in the second equation and just as precisely estimated? 


12 Consider Example 7.11, where, prior to computing the interaction between the race/ethnicity of a 
player and the city’s racial composition, we center the city composition variables about the sample av- 


erages, percblck and perchisp (which are, approximately, 16.55 and 10.82, respectively). The resulting 
estimated equation is 


Po et 
log(salary) = 10.23 + .0673 years + .0089 gamesyr + .00095 bavg + .0146 hrunsyr 
(2.18) (.0129) (.0034) (.00151) (.0164) 
+ .0045 rbisyr + .0072 runsyr + .0011 fldperc + .0075 allstar 
(.0076) (0.0046) (.0021) (.0029) 
+ .0080 black + .0273 hispan + .0125 black + (percblck — percbick) 
(.0840) (.1084) (.0050) 
+ .0201 hispan + (perchisp — perchisp) 
(.0098) 


n = 330, R? = 0.638. 


G) Why are the coefficients on black and hispan now so much different than those reported in 
equation (7.19)? In particular, how can you interpret these coefficients? 

(ii) What do you make of the fact that neither black nor hispan is statistically significant in the 
above equation? 

(ii) In comparing the above equation to (7.19), has anything else changed? Why or why not? 


13 (i) I the context of potential outcomes with a sample of size n, let [y,(0), y(1)] denote the pair of 
potential outcomes for unit i. Define the averages 


and define the sample average treatment effect (SATE) as SATE = y(1) — y(0). Can you compute 
the SATE given the typical program evaluation data set? 
(ii) Let yp and y; be the sample averages of the observed y; for the control and treated groups, 


respectively. Show howthese differ from y(0) and y(1). 


256 PART1 Regression Analysis with Cross-Sectional Data 


Computer Exercises 


C1 Use the data in GPA1 for this exercise. 

(i) Add the variables mothcoll and fathcoll to the equation estimated in (7.6) and report the results 
in the usual form. What happens to the estimated effect of PC ownership? Is PC still statistically 
significant? 

(ii) Test for joint significance of mothcoll and fathcoll in the equation from part (i) and be sure to 
report the p-value. 

(iii) Add hsGPA? to the model from part (i) and decide whether this generalization is needed. 


C2 Use the data in WAGE2 for this exercise. 
(i) Estimate the model 


log(wage) = By + Bieduc + Brexper + B3tenure + Bymarried 
+ Bsblack + Besouth + Burban + u 


and report the results in the usual form. Holding other factors fixed, what is the approximate 
difference in monthly salary between blacks and nonblacks? Is this difference statistically 
significant? 

(ii) Add the variables exper” and tenure’ to the equation and show that they are jointly insignificant 
at even the 20% level. 

(iii) Extend the original model to allow the return to education to depend on race and test whether 
the return to education does depend on race. 

(iv) Again, start with the original model, but now allow wages to differ across four groups of people: 
married and black, married and nonblack, single and black, and single and nonblack. What is 
the estimated wage differential between married blacks and married nonblacks? 


C3 A model that allows major league baseball player salary to differ by position is 
log(salary) = By + B,years + B.gamesyr + B,bavg + B,hrunsyr 
+ Bsrbisyr + Berunsyr + Bzfldperc + Bgallstar 
+ Byfrstbase + Byyscndbase + B,,thrdbase + B,shrtstop 


+ B,3catcher + u, 


where outfield is the base group. 

(i) State the null hypothesis that, controlling for other factors, catchers and outfielders earn, on 
average, the same amount. Test this hypothesis using the data in MLB1 and comment on the 
size of the estimated salary differential. 

(ii) State and test the null hypothesis that there is no difference in average salary across positions, 
once other factors have been controlled for. 

(iii) Are the results from parts (i) and (ii) consistent? If not, explain what is happening. 


C4 Use the data in GPA2 for this exercise. 
(i) | Consider the equation 


colgpa = By + Byhsize + Byhsize? + Bzhsperc + Basat 
+ B;female + Beathlete + u, 


where colgpa is cumulative college grade point average; hsize is size of high school graduating 
class, in hundreds; hsperc is academic percentile in graduating class; sat is combined SAT 
score; female is a binary gender variable; and athlete is a binary variable, which is one for 
student athletes. What are your expectations for the coefficients in this equation? Which ones 
are you unsure about? 


C5 


C6 


C7 


c8 
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(ii) Estimate the equation in part (1) and report the results in the usual form. What is the estimated 
GPA differential between athletes and nonathletes? Is it statistically significant? 

(iii) Drop sat from the model and reestimate the equation. Now, what is the estimated effect of being 
an athlete? Discuss why the estimate is different than that obtained in part (ii). 

(iv) In the model from part (i), allow the effect of being an athlete to differ by gender and test the 
null hypothesis that there is no ceteris paribus difference between women athletes and women 
nonathletes. 

(v) Does the effect of sat on colgpa differ by gender? Justify your answer. 


In Problem 2 in Chapter 4, we added the return on the firm’s stock, ros, to a model explaining CEO 
salary; ros turned out to be insignificant. Now, define a dummy variable, rosneg, which is equal to one 
if ros < 0 and equal to zero if ros = 0. Use CEOSALI to estimate the model 


log(salary) = By + B,log(sales) + Boroe + Byrosneg + u. 


Discuss the interpretation and statistical significance of 83. 


Use the data in SLEEP7S for this exercise. The equation of interest is 
sleep = By + B,totwrk + B,educ + B,age + Byage? + Bsyngkid + u. 


(i) Estimate this equation separately for men and women and report the results in the usual form. 
Are there notable differences in the two estimated equations? 

(ii) Compute the Chow test for equality of the parameters in the sleep equation for men and women. 
Use the form of the test that adds male and the interaction terms male-totwrk, ..., male-yngkid 
and uses the full set of observations. What are the relevant df for the test? Should you reject the 
null at the 5% level? 

(iii) Now, allow for a different intercept for males and females and determine whether the interaction 
terms involving male are jointly significant. 

(iv) Given the results from parts (ii) and (iii), what would be your final model? 


Use the data in WAGE] for this exercise. 

(i) Use equation (7.18) to estimate the gender differential when educ = 12.5. 
Compare this with the estimated differential when educ = 0. 

(ii) Run the regression used to obtain (7.18), but with female-(educ — 12.5) replacing female-educ. 
How do you interpret the coefficient on female now? 

(iii) Is the coefficient on female in part (ii) statistically significant? Compare this with (7.18) and 
comment. 


Use the data in LOANAPP for this exercise. The binary variable to be explained is approve, which is 
equal to one if a mortgage loan to an individual was approved. The key explanatory variable is white, a 
dummy variable equal to one if the applicant was white. The other applicants in the data set are black 
and Hispanic. 

To test for discrimination in the mortgage loan market, a linear probability model can be used: 


approve = By + Bywhite + other factors. 


(i) If there is discrimination against minorities, and the appropriate factors have been controlled 
for, what is the sign of B,? 

(ii) Regress approve on white and report the results in the usual form. Interpret the coefficient on 
white. Is it statistically significant? Is it practically large? 

Gii) As controls, add the variables hrat, obrat, loanprc, unem, male, married, dep, sch, cosign, 
chist, pubrec, mortlatl, mortlat2, and vr. What happens to the coefficient on white? Is there still 
evidence of discrimination against nonwhites? 
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(iv) Now, allow the effect of race to interact with the variable measuring other obligations as a 
percentage of income (obrat). Is the interaction term significant? 

(v) Using the model from part (iv), what is the effect of being white on the probability of approval 
when obrat = 32, which is roughly the mean value in the sample? Obtain a 95% confidence 
interval for this effect. 


C9 There has been much interest in whether the presence of 401(k) pension plans, available to many U.S. 
workers, increases net savings. The data set 401 KSUBS contains information on net financial assets (nettfa), 
family income (inc), a binary variable for eligibility in a 401(k) plan (e40/k), and several other variables. 
(i) | What fraction of the families in the sample are eligible for participation in a 401(k) plan? 

(ii) Estimate a linear probability model explaining 401(k) eligibility in terms of income, age, and 
gender. Include income and age in quadratic form, and report the results in the usual form. 

(iii) Would you say that 401(k) eligibility is independent of income and age? What about gender? 
Explain. 

(iv) Obtain the fitted values from the linear probability model estimated in part (ii). Are any fitted 
values negative or greater than one? a ee, ees 

(v) Using the fitted values e401k; from part (iv), define e401k; = 1 if e40/k; = .5 and e401k; = 0 if 
e401k,; < .5. Out of 9,275 families, how many are predicted to be eligible for a 401(k) plan? 

(vi) For the 5,638 families not eligible for a 401(k), what percentage of these are predicted not to 
have a 401(k), using the predictor e40/k;? For the 3,637 families eligible for a 401(k) plan, 
what percentage are predicted to have one? (It is helpful if your econometrics package has a 
“tabulate” command.) 

(vii) The overall percent correctly predicted is about 64.9%. Do you think this is a complete 
description of how well the model does, given your answers in part (vi)? 

(viii) Add the variable pira as an explanatory variable to the linear probability model. Other things 
equal, if a family has someone with an individual retirement account, how much higher is the 
estimated probability that the family is eligible for a 401(k) plan? Is it statistically different 
from zero at the 10% level? 


C10 Use the data in NBASAL for this exercise. 

(i) Estimate a linear regression model relating points per game to experience in the league and 
position (guard, forward, or center). Include experience in quadratic form and use centers as the 
base group. Report the results in the usual form. 

(ii) Why do you not include all three position dummy variables in part (1)? 

Gii) Holding experience fixed, does a guard score more than a center? How much more? Is the 
difference statistically significant? 

(iv) Now, add marital status to the equation. Holding position and experience fixed, are married 
players more productive (based on points per game)? 

(v) Add interactions of marital status with both experience variables. In this expanded model, is 
there strong evidence that marital status affects points per game? 

(vi) Estimate the model from part (iv) but use assists per game as the dependent variable. Are there 
any notable differences from part (iv)? Discuss. 


C11 Use the data in 401KSUBS for this exercise. 

(i) Compute the average, standard deviation, minimum, and maximum values of nettfa in the 
sample. 

(ii) Test the hypothesis that average nettfa does not differ by 401(k) eligibility status; use a two- 
sided alternative. What is the dollar amount of the estimated difference? 

(iii) From part (ii) of Computer Exercise C9, it is clear that e40/k is not exogenous in a simple 
regression model; at a minimum, it changes by income and age. Estimate a multiple linear 
regression model for nettfa that includes income, age, and e40/k as explanatory variables. The 
income and age variables should appear as quadratics. Now, what is the estimated dollar effect 
of 401(k) eligibility? 


C12 


C13 


(iv) 


(v) 


(vi) 


(vii 
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To the model estimated in part (iii), add the interactions e40/k-(age — 41) and 

e401k-(age — 41)*. Note that the average age in the sample is about 41, so that in the new 
model, the coefficient on e40/k is the estimated effect of 401(k) eligibility at the average age. 
Which interaction term is significant? 

Comparing the estimates from parts (iii) and (iv), do the estimated effects of 401(k) eligibility at 
age 41 differ much? Explain. 

Now, drop the interaction terms from the model, but define five family size dummy variables: 
fsizel, fsize2, fsize3, fsize4, and fsize5. The variable fsize5 is unity for families with five or more 
members. Include the family size dummies in the model estimated from part (iii); be sure to 
choose a base group. Are the family dummies significant at the 1% level? 

Now, do a Chow test for the model 


nettfa = By + Byinc + Byinc? + Bage + Baage? + Bse401k + u 


across the five family size categories, allowing for intercept differences. The restricted sum of 
squared residuals, SSR,, is obtained from part (vi) because that regression assumes all slopes are 
the same. The unrestricted sum of squared residuals is SSR, = SSR, + SSR, + = + SSRs, 
where SSR; is the sum of squared residuals for the equation estimated using only family 

size f. You should convince yourself that there are 30 parameters in the unrestricted model 

(5 intercepts plus 25 slopes) and 10 parameters in the restricted model (5 intercepts plus 

5 slopes). Therefore, the number of restrictions being tested is g = 20, and the df for the 
unrestricted model is 9,275 — 30 = 9,245. 


Use the data set in BEAUTY, which contains a subset of the variables (but more usable observations 
than in the regressions) reported by Hamermesh and Biddle (1994). 


(i) 


(ii) 


(iii) 


(iv) 


(v) 


(vi) 


Find the separate fractions of men and women that are classified as having above average looks. 
Are more people rated as having above average or below average looks? 

Test the null hypothesis that the population fractions of above-average-looking women and 
men are the same. Report the one-sided p-value that the fraction is higher for women. (Hint: 
Estimating a simple linear probability model is easiest.) 

Now estimate the model 


log(wage) = Bo + B,belavg + Boabvavg + u 


separately for men and women, and report the results in the usual form. In both cases, interpret 
the coefficient on belavg. Explain in words what the hypothesis Hy: 8, = 0 against H,: 6; < 0 
means, and find the p-values for men and women. 

Is there convincing evidence that women with above average looks earn more than women with 
average looks? Explain. 

For both men and women, add the explanatory variables educ, exper, exper’, union, goodhith, 
black, married, south, bigcity, smllcity, and service. Do the effects of the “looks” variables 
change in important ways? 

Use the SSR form of the Chow F statistic to test whether the slopes of the regression functions 
in part (v) differ across men and women. Be sure to allow for an intercept shift under the null. 


Use the data in APPLE to answer this question. 


(i) 


(ii) 


Define a binary variable as ecobuy = 1 if ecolbs > 0 and ecobuy = 0 if ecolbs = 0. In other 
words, ecobuy indicates whether, at the prices given, a family would buy any ecologically 
friendly apples. What fraction of families claim they would buy ecolabeled apples? 

Estimate the linear probability model 


ecobuy = By + B,ecoprc + Byregprc + B3faminc 
+ Byhhsize + Bseduc + Beage + u, 


and report the results in the usual form. Carefully interpret the coefficients on the price variables. 
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C14 


C15 


C16 


(iii) 


(iv) 
(v) 


(vi) 


Are the nonprice variables jointly significant in the LPM? (Use the usual F statistic, even 
though it is not valid when there is heteroskedasticity.) Which explanatory variable other than 
the price variables seems to have the most important effect on the decision to buy ecolabeled 
apples? Does this make sense to you? 

In the model from part (ii), replace faminc with log(faminc). Which model fits the data better, 
using faminc or log(faminc)? Interpret the coefficient on log(faminc). 

In the estimation in part (iv), how many estimated probabilities are negative? How many are 
bigger than one? Should you be concerned? 

For the estimation in part (iv), compute the percent correctly predicted for each outcome, 
ecobuy = 0 and ecobuy = 1. Which outcome is best predicted by the model? 


Use the data in CHARITY to answer this question. The variable respond is a dummy variable equal to 
one if a person responded with a contribution on the most recent mailing sent by a charitable organiza- 
tion. The variable resplast is a dummy variable equal to one if the person responded to the previous 
mailing, avggift is the average of past gifts (in Dutch guilders), and propresp is the proportion of times 
the person has responded to past mailings. 


G) 


(ii) 
(iii) 


(iv) 


(v) 


Estimate a linear probability model relating respond to resplast and avggift. Report the results 
in the usual form, and interpret the coefficient on resplast. 

Does the average value of past gifts seem to affect the probability of responding? 

Add the variable propresp to the model, and interpret its coefficient. (Be careful here: an 
increase of one in propresp is the largest possible change.) 

What happened to the coefficient on resplast when propresp was added to the regression? Does 
this make sense? 

Add mailsyear, the number of mailings per year, to the model. How big is its estimated effect? 
Why might this not be a good estimate of the causal effect of mailings on responding? 


Use the data in FERTIL2 to answer this question. 


© 

(ii) 
(iii) 
(iv) 


(v) 


(vi) 


(vii) 


Find the smallest and largest values of children in the sample. What is the average of children? 
Does any woman have exactly the average number of children? 

What percentage of women have electricity in the home? 

Compute the average of children for those without electricity and do the same for those with 
electricity. Comment on what you find. Test whether the population means are the same using a 
simple regression. 

From part (iii), can you infer that having electricity “causes” women to have fewer children? 
Explain. 

Estimate a multiple regression model of the kind reported in equation (7.37), but add age’, 
urban, and the three religious affiliation dummies. How does the estimated effect of having 
electricity compare with that in part (iii)? Is it still statistically significant? 

To the equation in part (v), add an interaction between electric and educ. Is its coefficient 
statistically significant? What happens to the coefficient on electric? 

The median and mode value for educ is 7. In the equation from part (vi), use the centered 
interaction term electric + (educ — 7) in place of electric + educ. What happens to the coef- 
ficient on electric compared with part (vi)? Why? How does the coefficient on electric compare 
with that in part (v)? 


Use the data in CATHOLIC to answer this question. 


G) 


(ii) 


In the entire sample, what percentage of the students attend a Catholic high school? What is the 
average of math12 in the entire sample? 

Run a simple regression of math12 on cathhs and report the results in the usual way. Interpret 
what you have found. 


C17 
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(iii) Now add the variables [faminc, motheduc, and fatheduc to the regression from part (ii). How 
many observations are used in the regression? What happens to the coefficient on cathhs, along 
with its statistical significance? 

(iv) Return to the simple regression of math12 on cathhs, but restrict the regression to observations 
used in the multiple regression from part (iii). Do any important conclusions change? 

(v) To the multiple regression in part (iii), add interactions between cathhs and each of the other 
explanatory variables. Are the interaction terms individually or jointly significant? 

(vi) What happens to the coefficient on cathhs in the regression from part (v). Explain why this 
coefficient is not very interesting. 

(vii) Compute the average partial effect of cathhs in the model estimated in part (v). How does it 
compare with the coefficients on cathhs in parts (iii) and (v)? 


Use the data in JTRAIN98 to answer this question. The variable unem98 is a binary variable indicating 

whether a worker was unemployed in 1998. It can be used to measure the effectiveness of the job train- 

ing program in reducing the probability of being unemployed. 

(i) | What percentage of workers was unemployed in 1998, after the job training program? How 
does this compare with the unemployment rate in 1996? 

(ii) Run the simple regression unem98 on train. How do you interepret the coefficient on train? Is it 
statistically significant? Does it make sense to you? 

(iii) Add to the regression in part (ii) the explanatory variables earn96, educ, age,and married. 
Now interpret the estimated training effect. Why does it differ so much from that in part (ii)? 

(iv) Now perform full regression adjustment by running a regression with a full set of interactions, 
where all variables (except the training indicator) are centered around their sample means: 


unem98; on train, earn96;, educ;, age, married, train; * (earn96; — earn96), 


train; * (educ; — educ), train; + (age; — age), train; + (married, — married). 


This regression uses all of the data. What ha ppens to the estimated average treatment effect of train 

compared with part (111). Does its standard error change much? 

(v) Are the interaction terms in part (iv) jointly significant? 

(vi) Verify that you obtain exactly the same average treatment effect if you run two separate 
regressions and use the formula in equation (7.43). That is, run two separate regressions for the 


O and unem98( 
control and treated groups, obtain the fitted values unem98;” and unem98;" for everyone in the 


sample, and then compute 
n naat 


Rua =n > [unem98\ 2 unem98 |. 
=i 


Check this with the coefficient on train in part (iv). Which approach is more convenient for obtaining 
a standard error? 


CHAPTER O 


Heteroskedasticity 


he homoskedasticity assumption, introduced in Chapter 3 for multiple regression, states that 

the variance of the unobserved error, u, conditional on the explanatory variables, is constant. 

Homoskedasticity fails whenever the variance of the unobserved factors changes across dif- 
ferent segments of the population, where the segments are determined by the different values of the 
explanatory variables. For example, in a savings equation, heteroskedasticity is present if the variance 
of the unobserved factors affecting savings increases with income. 

In Chapters 4 and 5, we saw that homoskedasticity is needed to justify the usual ż tests, F tests, 
and confidence intervals for OLS estimation of the linear regression model, even with large sample 
sizes. In this chapter, we discuss the available remedies when heteroskedasticity occurs, and we also 
show how to test for its presence. We begin by briefly reviewing the consequences of heteroskedastic- 
ity for ordinary least squares estimation. 


8-1 Consequences of Heteroskedasticity for OLS 
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Consider again the multiple linear regression model: 


y = Bo + Bix, + Boxy +--+ + Byxy + u. [8.1] 


In Chapter 3, we proved unbiasedness of the OLS estimators Bos Bi. Bo Seg By under the first four 
Gauss-Markov assumptions, MLR.1 through MLR.4. In Chapter 5, we showed that the same four 
assumptions imply consistency of OLS. The homoskedasticity assumption MLR.5, stated in terms of the 
error variance as Var(ulx,, x2, ...,X,) = o°, played no role in showing whether OLS was unbiased or 
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consistent. It is important to remember that heteroskedasticity does not cause bias or inconsistency in the 
OLS estimators of the 6;, whereas something like omitting an important variable would have this effect. 

The interpretation of our goodness-of-fit measures, R? and R?, is also unaffected by the pres- 
ence of heteroskedasticity. Why? Recall from Section 6-3 that the usual R-squared and the adjusted 
R-squared are different ways of estimating the population R-squared, which is simply 1 — 07/0}, 
where ø% is the population error variance and øg? is the population variance of y. The key point is 
that because both variances in the population R-squared are unconditional variances, the population 
R-squared is unaffected by the presence of heteroskedasticity in Var(ul|x,,..., Xp). Further, SSR/n 
consistently estimates a7, and SST/n consistently estimates a7, whether or not Var(u|x,,..., x4) 
is constant. The same is true when we use the degrees of freedom adjustments. Therefore, R? and 
R? are both consistent estimators of the population R-squared whether or not the homoskedasticity 
assumption holds. 

If heteroskedasticity does not cause bias or inconsistency in the OLS estimators, why did we 
introduce it as one of the Gauss-Markov assumptions? Recall from Chapter 3 that the estimators 
of the variances, Var(B;), are biased without the homoskedasticity assumption. Because the OLS 
standard errors are based directly on these variances, they are no longer valid for constructing confi- 
dence intervals and f statistics. The usual OLS f statistics do not have ¢ distributions in the presence 
of heteroskedasticity, and the problem is not resolved by using large sample sizes. We will see this 
explicitly for the simple regression case in the next section, where we derive the variance of the 
OLS slope estimator under heteroskedasticity and propose a valid estimator in the presence of het- 
eroskedasticity. Similarly, F statistics are no longer F distributed, and the LM statistic no longer has 
an asymptotic chi-square distribution. In summary, the statistics we used to test hypotheses under the 
Gauss-Markov assumptions are not valid in the presence of heteroskedasticity. 

We also know that the Gauss-Markov Theorem, which says that OLS is best linear unbiased, 
relies crucially on the homoskedasticity assumption. If Var(u|x) is not constant, OLS is no longer 
BLUE. In addition, OLS is no longer asymptotically efficient in the class of estimators described in 
Theorem 5.3. As we will see in Section 8-4, it is possible to find estimators that are more efficient 
than OLS in the presence of heteroskedasticity (although it requires knowing the form of the hetero- 
skedasticity). With relatively large sample sizes, it might not be so important to obtain an efficient 
estimator. In the next section, we show how the usual OLS test statistics can be modified so that they 
are valid, at least asymptotically. 


8-2 Heteroskedasticity-Robust Inference after OLS Estimation 


Because testing hypotheses is such an important component of any econometric analysis and the usual 
OLS inference is generally faulty in the presence of heteroskedasticity, we must decide if we should 
entirely abandon OLS. Fortunately, OLS is still useful. In the last two decades, econometricians have 
learned how to adjust standard errors and t, F, and LM statistics so that they are valid in the presence 
of heteroskedasticity of unknown form. This is very convenient because it means we can report new 
statistics that work regardless of the kind of heteroskedasticity present in the population. The methods in 
this section are known as heteroskedasticity-robust procedures because they are valid—at least in large 
samples—whether or not the errors have constant variance, and we do not need to know which is the case. 

We begin by sketching how the variances, Var( Ê). can be estimated in the presence of heteroske- 
dasticity. A careful derivation of the theory is well beyond the scope of this text, but the application of 
heteroskedasticity-robust methods is very easy now because many statistics and econometrics pack- 
ages compute these statistics as an option. 

First, consider the model with a single independent variable, where we include an i subscript for 
emphasis: 


Yi = Bo + Bixi + u 
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We assume throughout that the first four Gauss-Markov assumptions hold. If the errors contain 
heteroskedasticity, then 
Var(uj|x;) = Oi, 


where we put an i subscript on g’ to indicate that the variance of the error depends upon the particular 
value of x;. 
Write the OLS estimator as 


Under Assumptions MLR.1 through MLR.4 (that is, without the homoskedasticity assumption), and 
conditioning on the values x; in the sample, we can use the same arguments from Chapter 2 to show 
that 


Ms 


Il 


(x; = x)°o7 


Var(B,) = + ’ [8.2] 


SST? 


i 
i 


where SST, = S/_,(x; — x)? is the total sum of squares of the x;. When o? = o° for all i, this formula 
reduces to the usual form, o7/SST,. Equation (8.2) explicitly shows that, for the simple regression 
case, the variance formula derived under homoskedasticity is no longer valid when heteroskedasticity 
is present. 

Because the standard error of B, is based directly on estimating Var(B,), we need a way to 
estimate equation (8.2) when heteroskedasticity is present. White (1980) showed how this can be 
done. Let i; denote the OLS residuals from the initial regression of y on x. Then, a valid estimator of 
Var( Bi), for heteroskedasticity of any form (including homoskedasticity), is 


1 


Il 


(x; a x) it 
i 


1 
— 8.3 
SST? [8.3] 


which is easily computed from the data after the OLS regression. 

In what sense is (8.3) a valid estimator of Var(B,)? This is pretty subtle. Briefly, it can be 
shown that when equation (8.3) is multiplied by the sample size n, it converges in probability to 
E[(x; — u,) u? (02), which is the probability limit of n times (8.2). Ultimately, this is what is neces- 
sary for justifying the use of standard errors to construct confidence intervals and ż statistics. The law 
of large numbers and the central limit theorem play key roles in establishing these convergences. You 
can refer to White’s original paper for details, but that paper is quite technical. See also Wooldridge 
(2010, Chapter 4). 

A similar formula works in the general multiple regression model 


y = Bo + Bix, +o + Bere +u. 


It can be shown that a valid estimator of Var( Ê). under Assumptions MLR.1 through MLR.4, is 


Var(Ê,) == [8.4] 
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where f; denotes the i™ residual from regressing x; on all other independent variables, and SSR; is the 
sum of squared residuals from this regression (see Section 3-2 for the partialling out representation of the 
OLS estimates). The square root of the quantity in (8.4) is called the heteroskedasticity-robust standard 
error for Bi. In econometrics, these robust standard errors are usually attributed to White (1980). Earlier 
works in statistics, notably those by Eicker (1967) and Huber (1967), pointed to the possibility of obtain- 
ing such robust standard errors. In applied work, these are sometimes called White, Huber, or Eicker 
standard errors (or some hyphenated combination of these names). We will just refer to them as hetero- 
skedasticity-robust standard errors, or even just robust standard errors when the context is clear. 
Sometimes, as a degrees of freedom correction, (8.4) is multiplied by n/(n — k — 1) before taking 
the square root. The reasoning for this adjustment is that, if the squared OLS residuals a? were the same for 
all observations i—the strongest possible form of homoskedasticity in a sample—we would get the usual 
OLS standard errors. Other modifications of (8.4) are studied in MacKinnon and White (1985). Because 
all forms have only asymptotic justification and they are asymptotically equivalent, no form is uniformly 
preferred above all others. Typically, we use whatever form is computed by the regression package at hand. 
Once heteroskedasticity-robust standard errors are obtained, it is simple to construct a 
heteroskedasticity-robust ¢ statistic. Recall that the general form of the ¢ statistic is 


estimate — hypothesized value 


[8.5] 


standard error 
Because we are still using the OLS estimates and we have chosen the hypothesized value ahead of 
time, the only difference between the usual OLS f statistic and the heteroskedasticity-robust f statistic 
is in how the standard error in the denominator is computed. 

The term SSR; in equation (8.4) can be replaced with SST,(1 = Rẹ), where SST; is the total 
sum of squares of x; and R is the usual R-squared from regressing x; on all other explanatory vari- 
ables. [We implicitly used this equivalence in deriving equation (3.51).] Consequently, little sample 
variation in x;, or a strong linear relationship between x; and the other explanatory variables—that is, 
multicollinearity—can cause the heteroskedasticity-robust standard errors to be large. We discussed 
these issues with the usual OLS standard errors in Section 3-4. 


EXAMPLE 8.1 Log Wage Equation with Heteroskedasticity-Robust 
Standard Errors 


We estimate the model in Example 7.6, but we report the heteroskedasticity-robust standard errors 
along with the usual OLS standard errors. Some of the estimates are reported to more digits so that we 
can compare the usual standard errors with the heteroskedasticity-robust standard errors: 


—_— 
log(wage) = .321 + .213 marrmale — .198 marrfem — .110 singfem 


(.100) (.055) (.058) (.056) 
[.109] [.057] [.058 | [.057 ] 
+ .0789 educ + .0268 exper — .00054 exper’ 
(.0067) (.0052) (.00011) [8.6] 
[.0074 ] [.0051 ] [.00011 | 
+ .0291 tenure — .00053 tenure? 
(.0068) (.00023) 
[.0069 | [.00024 | 


n = 526, R? = AG61, 


The usual OLS standard errors are in parentheses, ( ), below the corresponding OLS estimate, and 
the heteroskedasticity-robust standard errors are in brackets, [ ]. The numbers in brackets are the only 
new things, as the equation is still estimated by OLS. 
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Several things are apparent from equation (8.6). First, in this particular application, any variable 
that was statistically significant using the usual ¢ statistic is still statistically significant using the 
heteroskedasticity-robust ¢ statistic. This occurs because the two sets of standard errors are not very 
different. (The associated p-values will differ slightly because the robust f statistics are not identical 
to the usual, nonrobust ż statistics.) The largest relative change in standard errors is for the coefficient 
on educ: the usual standard error is .0067, and the robust standard error is .0074. Still, the robust stan- 
dard error implies a robust ¢ statistic above 10. 

Equation (8.6) also shows that the robust standard errors can be either larger or smaller than the 
usual standard errors. For example, the robust standard error on exper is .0051, whereas the usual 
standard error is .0055. We do not know which will be larger ahead of time. As an empirical matter, 
the robust standard errors are often found to be larger than the usual standard errors. 

Before leaving this example, we must emphasize that we do not know, at this point, whether 
heteroskedasticity is even present in the population model underlying equation (8.6). All we have 
done is report, along with the usual standard errors, those that are valid (asymptotically) whether or 
not heteroskedasticity is present. We can see that no important conclusions are overturned by using 
the robust standard errors in this example. This often happens in applied work, but in other cases, the 
differences between the usual and robust standard errors are much larger. As an example of where the 
differences are substantial, see Computer Exercise C2. 


At this point, you may be asking the following question: if the heteroskedasticity-robust standard 
errors are valid more often than the usual OLS standard errors, why do we bother with the usual stan- 
dard errors at all? This is a sensible question. One reason the usual standard errors are still used in cross- 
sectional work is that, if the homoskedasticity assumption holds and the errors are normally distributed, 
then the usual f statistics have exact t distributions, regardless of the sample size (see Chapter 4). 
The robust standard errors and robust f statistics are justified only as the sample size becomes large, 
even if the CLM assumptions are true. With small sample sizes, the robust ż statistics can have distri- 
butions that are not very close to the ¢ distribution, and that could throw off our inference. 

In large sample sizes, we can make a case for always reporting only the heteroskedasticity-robust 
standard errors in cross-sectional applications, and this practice is being followed more and more in 
applied work. It is also common to report both standard errors, as in equation (8.6), so that a reader 
can determine whether any conclusions are sensitive to the standard error in use. 

It is also possible to obtain F and LM statistics that are robust to heteroskedasticity of an unknown, 
arbitrary form. The heteroskedasticity-robust F statistic (or a simple transformation of it) is also called 
a heteroskedasticity-robust Wald statistic. A general treatment of the Wald statistic requires matrix alge- 
bra and is sketched in Advanced Treatment E; see Wooldridge (2010, Chapter 4) for a more detailed 
treatment. Nevertheless, using heteroskedasticity-robust statistics for multiple exclusion restrictions is 
straightforward because many econometrics packages now compute such statistics routinely. 


EXAMPLE 8.2 Heteroskedasticity-Robust F Statistic 
Using the data for the spring semester in GPA3, we estimate the following equation: 


cumgpa = 1.47 + .00114 sat — .00857 hsperc + .00250 tothrs 
(.23) (.00018)  (.00124) (.00073) 
[.22] [.00019] — [.00140] [.00073 ] 
+ .303 female — 128 black — .059 white [8.7] 
(.059) (.147) (.141) 
[.059] [.118] [.110] 
n = 366, R? = .4006, R° = .3905. 
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Again, the differences between the usual standard errors and the heteroskedasticity-robust standard 
errors are not very big, and use of the robust f statistics does not change the statistical significance 
of any independent variable. Joint significance tests are not much affected either. Suppose we wish 
to test the null hypothesis that, after the other factors are controlled for, there are no differences in 
cumgpa by race. This is stated as Ho: Bprack = 9, Bynite = 0. The usual F statistic is easily obtained, 
once we have the R-squared from the restricted model; this turns out to be .3983. The F statistic is 
then [(.4006 — .3983)/(1 — .4006) |(359/2) = .69. If heteroskedasticity is present, this version of 
the test is invalid. The heteroskedasticity-robust version has no simple form, but it can be computed 
using certain statistical packages. The value of the heteroskedasticity-robust F statistic turns out to 
be .75, which differs only slightly from the nonrobust version. The p-value for the robust test is .474, 
which is not close to standard significance levels. We fail to reject the null hypothesis using either test. 


Because the usual sum of squared residuals form of the F statistic is not valid under heteroskedas- 
ticity, we must be careful in computing a Chow test of common coefficients across two groups. The 
form of the statistic in equation (7.24) is not valid if heteroskedasticity is present, including the simple 
case where the error variance differs across the two groups. Instead, we can obtain a heteroskedasticity- 
robust Chow test by including a dummy variable distinguishing the two groups along with interactions 
between that dummy variable and all other explanatory variables. We can then test whether there is no 
difference in the two regression functions—by testing that the coefficients on the dummy variable and 
all interactions are zero—or just test whether the slopes are all the same, in which case we leave the 
coefficient on the dummy variable unrestricted. See Computer Exercise C14 for an example. 


8-2a Computing Heteroskedasticity-Robust LM Tests 


Not all regression packages compute F statistics 
GOING FURTHER 8.1 that are robust to heteroskedasticity. Therefore, it is 
Evaluate the following statement: The sometimes convenient to have a way of obtaining a 
heteroskedasticity-robust standard errors | test of multiple exclusion restrictions that is robust 
are always bigger than the usual standard | to heteroskedasticity and does not require a particu- 
errors. lar kind of econometric software. It turns out that a 
heteroskedasticity-robust LM statistic is easily 
obtained using virtually any regression package. 
To illustrate computation of the robust LM statistic, consider the model 


y = Bo + Bix, + Bory + Bx + Byxy + Bsx5 +u, 


and suppose we would like to test Hy: 64 = 0, 8; = 0. To obtain the usual LM statistic, we would 
first estimate the restricted model (that is, the model without x, and x;) to obtain the residuals, 7. 
Then, we would regress 7 on all of the independent variables and the LM = n: RZ, where RŽ is the 
usual R-squared from this regression. 

Obtaining a version that is robust to heteroskedasticity requires more work. One way to compute 
the statistic requires only OLS regressions. We need the residuals, say, 7,, from the regression of 
X4 ON X1, X2, X3. Also, we need the residuals, say, 7, from the regression of x; on x), X2, x3. Thus, we 
regress each of the independent variables excluded under the null on all of the included independent 
variables. We keep the residuals each time. The final step appears odd, but it is, after all, just a compu- 
tational device. Run the regression of 


1 on 7,2, 70, [8.8] 


without an intercept. Yes, we actually define a dependent variable equal to the value one for all 
observations. We regress this onto the products 7, # and 7,7. The robust LM statistic turns out to be 
n — SSR,, where SSR, is just the usual sum of squared residuals from regression (8.8). 
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The reason this works is somewhat technical. Basically, this is doing for the LM test what the 
robust standard errors do for the ż test. [See Wooldridge (1991b) or Davidson and MacKinnon (1993) 
for a more detailed discussion. ] 

We now summarize the computation of the heteroskedasticity-robust LM statistic in the general case. 


A Heteroskedasticity-Robust LM Statistic: 


1. Obtain the residuals 7 from the restricted model. 


2. Regress each of the independent variables excluded under the null on all of the included indepen- 
dent variables; if there are q excluded variables, this leads to q sets of residuals (Pi %,.-+5 T): 


3. Find the products of each T, and 7% (for all observations). 


4. Run the regression of | on 7, 77, ..., 7 t, without an intercept. The heteroskedasticity-robust 
LM statistic is n — SSR,, where SSR, is just the usual sum of squared residuals from this final 
regression. Under Hp, LM is distributed approximately as Xe 


Once the robust LM statistic is obtained, the rejection rule and computation of p-values are the same 
as for the usual LM statistic in Section 5-2. 


EXAMPLE 8.3 Heteroskedasticity-Robust LM Statistic 


We use the data in CRIME] to test whether the average sentence length served for past convictions 
affects the number of arrests in the current year (1986). The estimated model is 


narr86 = 561 — .136 penv + .0178 avgsen — .00052 avgsen? 


(.036) (.040) (.0097) (.00030) 
[.040] [.034] [.0101] [.00021] 
— .0394 ptime86 — .0505 qemp86 — .00148 inc86 
(.0087) (.0144) (.00034) [8.9] 
[.0062 | [.0142] [.00023] 
+ .325 black + .193 hispan 
(.045) (.040) 
[.058] [.040] 


n = 2,725, R? = .0728. 


In this example, there are more substantial differences between some of the usual standard errors 
and the robust standard errors. For example, the usual ¢ statistic on avgsen? is about —1.73, while the 
robust f statistic is about —2.48. Thus, avgsen” is more significant using the robust standard error. 

The effect of avgsen on narr&6 is somewhat difficult to reconcile. Because the relationship 
is quadratic, we can figure out where avgsen has a positive effect on narr86 and where the effect 
becomes negative. The turning point is .0178/[2(.00052)] ~ 17.12; recall that this is measured in 
months. Literally, this means that narr86 is positively related to avgsen when avgsen is less than 17 
months; then avgsen has the expected deterrent effect after 17 months. 

To see whether average sentence length has a statistically significant effect on narr86, we must 
test the joint hypothesis Ho: Bayesen = 9, Bavesen? = 9. Using the usual LM statistic (see Section 5-2), we 
obtain LM = 3.54; in a chi-square distribution with two df, this yields a p-value = .170. Thus, we do 
not reject Ho at even the 15% level. The heteroskedasticity-robust LM statistic is LM = 4.00 (rounded 
to two decimal places), with a p-value = .135. This is still not very strong evidence against Ho; avgsen 
does not appear to have a strong effect on narr&6. [Incidentally, when avgsen appears alone in (8.9), 
that is, without the quadratic term, its usual ¢ statistic is .658, and its robust f statistic is .592.] 
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8-3 Testing for Heteroskedasticity 


The heteroskedasticity-robust standard errors provide a simple method for computing f statistics that 
are asymptotically ¢ distributed whether or not heteroskedasticity is present. We have also seen that 
heteroskedasticity-robust F and LM statistics are available. Implementing these tests does not require 
knowing whether or not heteroskedasticity is present. Nevertheless, there are still some good reasons 
for having simple tests that can detect its presence. First, as we mentioned in the previous section, 
the usual ¢ statistics have exact ¢ distributions under the classical linear model assumptions. For this 
reason, many economists still prefer to see the usual OLS standard errors and test statistics reported, 
unless there is evidence of heteroskedasticity. Second, if heteroskedasticity is present, the OLS esti- 
mator is no longer the best linear unbiased estimator. As we will see in Section 8-4, it is possible to 
obtain a better estimator than OLS when the form of heteroskedasticity is known. 

Many tests for heteroskedasticity have been suggested over the years. Some of them, while hav- 
ing the ability to detect heteroskedasticity, do not directly test the assumption that the variance of 
the error does not depend upon the independent variables. We will restrict ourselves to more modern 
tests, which detect the kind of heteroskedasticity that invalidates the usual OLS statistics. This also 
has the benefit of putting all tests in the same framework. 

As usual, we start with the linear model 


y = Bo + Bix + Box. +--+ Bux, + u, [8.10] 
where Assumptions MLR.1 through MLR.4 are maintained in this section. In particular, we assume 
that E(ulx), x2, . . . , X4) = 0, so that OLS is unbiased and consistent. 

We take the null hypothesis to be that Assumption MLR.S5 is true: 
Ho: Var luli: tie cig) = 0°. [8.11] 


That is, we assume that the ideal assumption of homoskedasticity holds, and we require the data 
to tell us otherwise. If we cannot reject (8.11) at a sufficiently small significance level, we usually 
conclude that heteroskedasticity is not a problem. However, remember that we never accept Hy; we 
simply fail to reject it. 

Because we are assuming that u has a zero conditional expectation, Var(u|x) = E(u’|x), and so 
the null hypothesis of homoskedasticity is equivalent to 


Ho: E(w’|x,, 9, ...,%,) = E(w’) = o’. 


This shows that, in order to test for violation of the homoskedasticity assumption, we want to test 
whether wu” is related (in expected value) to one or more of the explanatory variables. If Hy is false, the 
expected value of u’, given the independent variables, can be virtually any function of the x; A simple 
approach is to assume a linear function: 


uw = ôo + ôx + Sx, ++ + by, + y, [8.12] 


where v is an error term with mean zero given the x;. Pay close attention to the dependent variable in 
this equation: it is the square of the error in the original regression equation, (8.10). The null hypoth- 
esis of homoskedasticity is 


Ho: on 55 ia Ô; 0. [8.13] 


Under the null hypothesis, it is often reasonable to assume that the error in (8.12), v, is independent 
of x1, X2, ..., X,. Then, we know from Section 5-2 that either the F or LM statistics for the overall sig- 
nificance of the independent variables in explaining 4? can be used to test (8.13). Both statistics would 
have asymptotic justification, even though u? cannot be normally distributed. (For example, if u is 
normally distributed, then u’/c? is distributed as x7.) If we could observe the uv’ in the sample, then 
we could easily compute this statistic by running the OLS regression of u% on x), X>, ... , Xp using all 
n observations. 
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As we have emphasized before, we never know the actual errors in the population model, but we 
do have estimates of them: the OLS residual, iz;, is an estimate of the error u; for observation i. Thus, 
we can estimate the equation 


i? = dy) + ôx + dx) +--+ + 6.x, + error [8.14] 


and compute the F or LM statistics for the joint significance of x,,..., x,. It turns out that using the 
OLS residuals in place of the errors does not affect the large sample distribution of the F or LM statis- 
tics, although showing this is pretty complicated. 

The F and LM statistics both depend on the R-squared from regression (8.14); call this R%2 to dis- 
tinguish it from the R-squared in estimating equation (8.10). Then, the F statistic is 


Ri2lk 
(1 — R22)/(n = k — 1) 


F= [8.15] 
where k is the number of regressors in (8.14); this is the same number of independent variables in 
(8.10). Computing (8.15) by hand is rarely necessary, because most regression packages automati- 
cally compute the F statistic for overall significance of a regression. This F statistic has (approxi- 
mately) an F} , -,—, distribution under the null hypothesis of homoskedasticity. 

The LM statistic for heteroskedasticity is just the sample size times the R-squared from (8.14): 


LM = n: R2- [8.16] 


Under the null hypothesis, LM is distributed asymptotically as yj. This is also very easy to obtain 
after running regression (8.14). 

The LM version of the test is typically called the Breusch-Pagan test for heteroskedasticity 
(BP test). Breusch and Pagan (1979) suggested a different form of the test that assumes the errors are 
normally distributed. Koenker (1981) suggested the form of the LM statistic in (8.16), and it is gener- 
ally preferred due to its greater applicability. 

We summarize the steps for testing for heteroskedasticity using the BP test: 


The Breusch-Pagan Test for Heteroskedasticity: 


1. Estimate the model (8.10) by OLS, as usual. Obtain the squared OLS residuals, i’ (one for each 
observation). 


2. Run the regression in (8.14). Keep the R-squared from this regression, R72. 


3. Form either the F statistic or the LM statistic and compute the p-value (using the F% ,,-,— dis- 
tribution in the former case and the yj distribution in the latter case). If the p-value is suffi- 
ciently small, that is, below the chosen significance level, then we reject the null hypothesis of 
homoskedasticity. 


If the BP test results in a small enough p-value, some corrective measure should be taken. One 
possibility is to just use the heteroskedasticity-robust standard errors and test statistics discussed in 
the previous section. Another possibility is discussed in Section 8-4. 


EXAMPLE 8.4 Heteroskedasticity in Housing Price Equations 


We use the data in HPRICE1 to test for heteroskedasticity in a simple housing price equation. The 
estimated equation using the levels of all variables is 
price = —21.77 + .00207 lotsize + .123 sqrft + 13.85 bdrms 
(29.48) (.00064) (.013) (9.01) [8.17] 
n = 88, R? = .672. 
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This equation tells us nothing about whether the error in the population model is heteroskedastic. 
We need to regress the squared OLS residuals on the independent variables. The R-squared from the 
regression of i” on lotsize, sqrft, and bdrms is R32 = .1601. With n = 88 and k = 3, this produces an 
F statistic for significance of the independent variables of F = [.1601/(1 — .1601) ](84/3) = 5.34. 
The associated p-value is .002, which is strong evidence against the null. The LM statistic is 
88(.1601) = 14.09; this gives a p-value = .0028 (using the y3 distribution), giving essentially the 
same conclusion as the F statistic. This means that the usual standard errors reported in (8.17) are not 
reliable. 

In Chapter 6, we mentioned that one benefit of using the logarithmic functional form for the 
dependent variable is that heteroskedasticity is often reduced. In the current application, let us put 
price, lotsize, and sqrft in logarithmic form, so that the elasticities of price, with respect to lotsize and 
sqrft, are constant. The estimated equation is 


—, 
log(price) = —1.30 + .168 log(lotsize) + .700 log(sqrft) + 0.37 bdrms 


(.65) (.038) (.093) (.028) [8.18] 
n = 88, R? = 643. 


Regressing the squared OLS residuals from this regression on log(/otsize), log(sqrft), and bdrms gives 
R2 = .0480. Thus, F = 1.41 (p-value = .245), and LM = 4.22 (p-value = .239). Therefore, we 
fail to reject the null hypothesis of homoskedasticity in the model with the logarithmic functional 
forms. The occurrence of less heteroskedasticity with the dependent variable in logarithmic form has 
been noticed in many empirical applications. 


If we suspect that heteroskedasticity depends 


= only upon certain independent variables, we can 
easily modify the Breusch-Pagan test: we simply 
Consider wage equation (7.11), where | regress i” on whatever independent variables we 
you think that the conditional variance of | choose and carry out the appropriate F or LM test. 
log(wage) does not depend on educ, exper, | Remember that the appropriate degrees of freedom 
or tenure. However, you are worried that | depends upon the number of independent variables 
the variance of log(wage) differs across the | in the regression with # as the dependent variable; 
four demographic groups of married males, the number of independent variables showing up in 
married females, single males, and single | equation (8.10) is irrelevant. 

females. What regression would you run to 
test for heteroskedasticity? What are the 
degrees of freedom in the F test? 


If the squared residuals are regressed on only a 
single independent variable, the test for heteroske- 
dasticity is just the usual f statistic on the variable. A 
significant f statistic suggests that heteroskedasticity 
is a problem. 


8-3a The White Test for Heteroskedasticity 


In Chapter 5, we showed that the usual OLS standard errors and test statistics are asymptotically 
valid, provided all of the Gauss-Markov assumptions hold. It turns out that the homoskedasticity 
assumption, Var(u,|x,,...,x;,) = o°, can be replaced with the weaker assumption that the squared 
error, u’, is uncorrelated with all the independent variables (x), the squares of the independent vari- 
ables (x7), and all the cross products (x,x, for j # h). This observation motivated White (1980) to 
propose a test for heteroskedasticity that adds the squares and cross products of all the independent 
variables to equation (8.14). The test is explicitly intended to test for forms of heteroskedasticity that 


invalidate the usual OLS standard errors and test statistics. 
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When the model contains k = 3 independent variables, the White test is based on an 
estimation of 


R = ôo + ôx + Sox, + ôx + Ôp + 55x53 + eo 
[8.19] 

+ ôx X + ÔgxxX3 + ÔgXX3 + error. 

Compared with the Breusch-Pagan test, this equation has six more regressors. The White test for 

heteroskedasticity is the LM statistic for testing that all of the 6; in equation (8.19) are zero, except 

for the intercept. Thus, nine restrictions are being tested in this case. We can also use an F test of this 

hypothesis; both tests have asymptotic justification. 

With only three independent variables in the original model, equation (8.19) has nine indepen- 
dent variables. With six independent variables in the original model, the White regression would gen- 
erally involve 27 regressors (unless some are redundant). This abundance of regressors is a weakness 
in the pure form of the White test: it uses many degrees of freedom for models with just a moderate 
number of independent variables. 

It is possible to obtain a test that is easier to implement than the White test and more conserving 
on degrees of freedom. To create the test, recall that the difference between the White and Breusch- 
Pagan tests is that the former includes the squares and cross products of the independent variables. 
We can preserve the spirit of the White test while conserving on degrees of freedom by using the OLS 
fitted values in a test for heteroskedasticity. Remember that the fitted values are defined, for each 
observation i, by 


ĵi = Bo + Bixa + Born +o + Beie 


These are just linear functions of the independent variables. If we square the fitted values, we get a 
particular function of all the squares and cross products of the independent variables. This suggests 
testing for heteroskedasticity by estimating the equation 


# = 5, + 6,3 + 6,5? + error, [8.20] 


where y stands for the fitted values. It is important not to confuse y and y in this equation. We use the 
fitted values because they are functions of the independent variables (and the estimated parameters); 
using y in (8.20) does not produce a valid test for heteroskedasticity. 

We can use the F or LM statistic for the null hypothesis Hy: 6, = 0, ô = 0 in equation (8.20). 
This results in two restrictions in testing the null of homoskedasticity, regardless of the number of 
independent variables in the original model. Conserving on degrees of freedom in this way is often a 
good idea, and it also makes the test easy to implement. 

Because j is an estimate of the expected value of y, given the x;, using (8.20) to test for hetero- 
skedasticity is useful in cases where the variance is thought to change with the level of the expected 
value, E(y|x). The test from (8.20) can be viewed as a special case of the White test, as equation 
(8.20) can be shown to impose restrictions on the parameters in equation (8.19). 


A Special Case of the White Test for Heteroskedasticity: 
1. Estimate the model (8.10) by OLS, as usual. Obtain the OLS residuals ĉ and the fitted values 7. 
Compute the squared OLS residuals i? and the squared fitted values F. 
2. Run the regression in equation (8.20). Keep the R-squared from this regression, R22. 


3. Form either the F or LM statistic and compute the p-value (using the F, ,,_3 distribution in the 
former case and the y3 distribution in the latter case). 
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EXAMPLE 8.5 Special Form of the White Test in the Log 


Housing Price Equation 


We apply the special case of the White test to equation (8.18), where we use the LM form of the 
statistic. The important thing to remember is that the chi-square distribution always has two df. The 
regression of @ on Iprice, (Iprice)*, where [price denotes the fitted values from (8.18), produces 
Riz = 0392; thus, LM = 88(.0392) = 3.45, and the p-value = .178. This is stronger evidence of 
heteroskedasticity than is provided by the Breusch-Pagan test, but we still fail to reject homoskedas- 
ticity at even the 15% level. 


Before leaving this section, we should discuss one important caveat. We have interpreted a rejection 
using one of the heteroskedasticity tests as evidence of heteroskedasticity. This is appropriate provided we 
maintain Assumptions MLR.1 through MLR.4. But, if MLR.4 is violated—in particular, if the functional 
form of E(y|x) is misspecified—then a test for heteroskedasticity can reject Hp, even if Var(y|x) is con- 
stant. For example, if we omit one or more quadratic terms in a regression model or use the level model 
when we should use the log, a test for heteroskedasticity can be significant. This has led some economists 
to view tests for heteroskedasticity as general misspecification tests. However, there are better, more direct 
tests for functional form misspecification, and we will cover some of them in Section 9-1. It is better to 
use explicit tests for functional form first, as functional form misspecification is more important than het- 
eroskedasticity. Then, once we are satisfied with the functional form, we can test for heteroskedasticity. 


8-4 Weighted Least Squares Estimation 


If heteroskedasticity is detected using one of the tests in Section 8-3, we know from Section 8-2 that 
one possible response is to use heteroskedasticity-robust statistics after estimation by OLS. Before 
the development of heteroskedasticity-robust statistics, the response to a finding of heteroskedasticity 
was to specify its form and use a weighted least squares method, which we develop in this section. 
As we will argue, if we have correctly specified the form of the variance (as a function of explana- 
tory variables), then weighted least squares (WLS) is more efficient than OLS, and WLS leads to new 
t and F statistics that have ft and F distributions. We will also discuss the implications of using the 
wrong form of the variance in the WLS procedure. 


8-4a The Heteroskedasticity Is Known up to a Multiplicative Constant 
Let x denote all the explanatory variables in equation (8.10) and assume that 
Var(ulx) = o°h(x), [8.21] 


where h(x) is some function of the explanatory variables that determines the heteroskedasticity. 
Because variances must be positive, h(x) > 0 for all possible values of the independent variables. For 
now, we assume that the function A(x) is known. The population parameter o° is unknown, but we 
will be able to estimate it from a data sample. 

For a random drawing from the population, we can write o? = Var(u,|x;) = o°h(x;) = o7h,, 
where we again use the notation x; to denote all independent variables for observation i, and h; 
changes with each observation because the independent variables change across observations. For 
example, consider the simple savings function 


sav; = By + pinc; + u; [8.22] 
Var(ujlinc;) = o7inc;. [8.23] 


274 PART1 Regression Analysis with Cross-Sectional Data 


Here, h(x) = h(inc) = inc: the variance of the error is proportional to the level of income. This 
means that, as income increases, the variability in savings increases. (If 6, > 0, the expected value of 
savings also increases with income.) Because inc is always positive, the variance in equation (8.23) is 
always guaranteed to be positive. The standard deviation of u; conditional on inc, is oV inc;. 

How can we use the information in equation (8.21) to estimate the 6,? Essentially, we take the 
original equation, 


Yi = Bo + Brita + Born To + BX + Uj, [8.24] 
which contains heteroskedastic errors, and transform it into an equation that has homoskedastic errors 
(and satisfies the other Gauss-Markov assumptions). Because h; is just a function of x;, u;/Vh; has a 
zero expected value conditional on x;. Further, because Var(u,|x;) = E(u?|x;) = 07h;, the variance of 
u;/ V'h; (conditional on x;) is o°: 


E[(uj/Vh,)?] = E(u?)/h; = (07h, )/h; = 0, 


where we have suppressed the conditioning on x; for simplicity. We can divide equation (8.24) by Vh; 
to get 


IVb; = Boy Vh; + Bi(xa/ Vh) + Balx V'h) pae 


[8.25] 
+ Bela Vh) F (u/Vh,) 
or 
Yi = Boro + Bish +e + Bete + ui, [8.26] 


where x) = 1/Vh; and the other starred variables denote the corresponding original variables divided 
by Vh;. 

Equation (8.26) looks a little peculiar, but the important thing to remember is that we derived it so we 
could obtain estimators of the £, that have better efficiency properties than OLS. The intercept £ọ in the 
original equation (8.24) is now multiplying the variable xi = 1/Vh;. Each slope parameter in B; multiplies 
a new variable that rarely has a useful interpretation. This should not cause problems if we recall that, for 
interpreting the parameters and the model, we always want to return to the original equation (8.24). 

In the preceding savings example, the transformed equation looks like 


savJV inc; = Bo(1/V inc;) + B,V inc; + uj, 


where we use the fact that inc/ Vinc; = Vinc;. Nevertheless, 6; is the marginal propensity to save out 
of income, an interpretation we obtain from equation (8.22). 

Equation (8.26) is linear in its parameters (so it satisfies MLR.1), and the random sampling 
assumption has not changed. Further, u; has a zero mean and a constant variance (a’) , conditional on x;. 
This means that if the original equation satisfies the first four Gauss-Markov assumptions, then the 
transformed equation (8.26) satisfies all five Gauss-Markov assumptions. Also, if u; has a normal 
distribution, then u; has a normal distribution with variance a°. Therefore, the transformed equation 
satisfies the classical linear model assumptions (MLR.1 through MLR.6) if the original model does so 
except for the homoskedasticity assumption. 

Because we know that OLS has appealing properties (is BLUE, for example) under the Gauss- 
Markov assumptions, the discussion in the previous paragraph suggests estimating the parameters in 
equation (8.26) by ordinary least squares. These estimators, Bg, Bj, ... , Bj, will be different from the 
OLS estimators in the original equation. The 6; are examples of generalized least squares (GLS) 
estimators. In this case, the GLS estimators are used to account for heteroskedasticity in the errors. 
We will encounter other GLS estimators in Chapter 12. 

Because equation (8.26) satisfies all of the ideal assumptions, standard errors, t statistics, and 
F statistics can all be obtained from regressions using the transformed variables. The sum of squared 
residuals from (8.26) divided by the degrees of freedom is an unbiased estimator of g”. Further, the 
GLS estimators, because they are the best linear unbiased estimators of the £, are necessarily more 
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efficient than the OLS estimators Ê; obtained from the untransformed equation. Essentially, after we 
have transformed the variables, we simply use standard OLS analysis. But we must remember to 
interpret the estimates in light of the original equation. 

The GLS estimators for correcting heteroskedasticity are called weighted least squares (WLS) 
estimators. This name comes from the fact that the 8; minimize the weighted sum of squared residu- 
als, where each squared residual is weighted by 1/h,. The idea is that less weight is given to observa- 
tions with a higher error variance; OLS gives each observation the same weight because it is best 
when the error variance is identical for all partitions of the population. Mathematically, the WLS 
estimators are the values of the b; that make 

n 


X (y bo — bixa bx — 7° Dy Xin) Mh; [8.27] 


i=1 


as small as possible. Bringing the square root of 1/h; inside the squared residual shows that the weighted 
sum of squared residuals is identical to the sum of squared residuals in the transformed variables: 


n 


. (yi boxo — bixi — Dox, — = DXi)”. 
Because OLS minimizes the sum of squared residuals (regardless of the definitions of the dependent 
variable and independent variable), it follows that the WLS estimators that minimize (8.27) are sim- 
ply the OLS estimators from (8.26). Note carefully that the squared residuals in (8.27) are weighted 
by 1/h;, whereas the transformed variables in (8.26) are weighted by 1/V/h,. 

A weighted least squares estimator can be defined for any set of positive weights. Ordinary least 
squares is the special case that gives equal weight to all observations. The efficient procedure, GLS, 
weights each squared residual by the inverse of the conditional variance of u; given x;. 

Obtaining the transformed variables in equation (8.25) in order to manually perform weighted 
least squares can be tedious, and the chance of making mistakes is nontrivial. Fortunately, most mod- 
ern regression packages have a feature for computing weighted least squares. Typically, along with 
the dependent and independent variables in the original model, we just specify the weighting func- 
tion, 1/h;, appearing in (8.27). That is, we specify weights proportional to the inverse of the variance. 
In addition to making mistakes less likely, this forces us to interpret weighted least squares estimates 
in the original model. In fact, we can write out the estimated equation in the usual way. The estimates 
and standard errors will be different from OLS, but the way we interpret those estimates, standard 
errors, and test statistics is the same. 

Econometrics packages that have a built-in WLS option will report an R-squared (and adjusted 
R-squared) along with WLS estimates and standard errors. Typically, the WLS R-squared is obtained 
from the weighted SSR, obtained from minimizing equation (8.27), and a weighted total sum of 
squares (SST), obtained by using the same weights but setting all of the slope coefficients in equation 
(8.27), bi, bo, ..., by, to zero. As a goodness-of-fit measure, this R-squared is not especially useful, 
as it effectively measures explained variation in y; rather than y,. Nevertheless, the WLS R-squareds 
computed as just described are appropriate for computing F statistics for exclusion restrictions (pro- 
vided we have properly specified the variance function). As in the case of OLS, the SST terms cancel, 
and so we obtain the F statistic based on the weighted SSR. 

The R-squared from running the OLS regression in equation (8.26) is even less useful as a 
goodness-of-fit measure, as the computation of SST would make little sense: one would necessar- 
ily exclude an intercept from the regression, in which case regression packages typically compute 
the SST without properly centering the y;. This is another reason for using a WLS option that is 
pre-programmed in a regression package because at least the reported R-squared properly compares 
the model with all of the independent variables to a model with only an intercept. Because the SST 
cancels out when testing exclusion restrictions, improperly computing SST does not affect the 
R-squared form of the F statistic. Nevertheless, computing such an R-squared tempts one to think the 
equation fits better than it does. 
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Financial Wealth Equation 


We now estimate equations that explain net total financial wealth (nettfa, measured in $1,000s) in 
terms of income (inc, also measured in $1,000s) and some other variables, including age, gender, 
and an indicator for whether the person is eligible for a 401(k) pension plan. We use the data on 
single people (fsize = 1) in 401KSUBS. In Computer Exercise C12 in Chapter 6, it was found that 
a specific quadratic function in age, namely (age — 25)”, fit the data just as well as an unrestricted 
quadratic. Plus, the restricted form gives a simplified interpretation because the minimum age in the 
sample is 25: nettfa is an increasing function of age after age = 25. 

The results are reported in Table 8.1. Because we suspect heteroskedasticity, we report the 
heteroskedasticity-robust standard errors for OLS. The weighted least squares estimates, and their 
standard errors, are obtained under the assumption Var(ulinc) = inc. 

Without controlling for other factors, another dollar of income is estimated to increase nettfa by 
about 82¢ when OLS is used; the WLS estimate is smaller, about 79¢. The difference is not large; we 
certainly do not expect them to be identical. The WLS coefficient does have a smaller standard error 
than OLS, almost 40% smaller, provided we assume the model Var(nettfalinc) = o’inc is correct. 

Adding the other controls reduced the inc coefficient somewhat, with the OLS estimate still 
larger than the WLS estimate. Again, the WLS estimate of §,,,. is more precise. Age has an increasing 
effect starting at age = 25, with the OLS estimate showing a larger effect. The WLS estimate of Bage 
is more precise in this case. Gender does not have a statistically significant effect on nettfa, but being 
eligible for a 401(k) plan does: the OLS estimate is that those eligible, holding fixed income, age, and 
gender, have net total financial assets about $6,890 higher. The WLS estimate is substantially below 
the OLS estimate and suggests a misspecification of the functional form in the mean equation. (One 

possibility is to interact e40/k and inc; see Computer 


= GOING FURTHER 8.3 Exercise C11.) 


, , Using WLS, the F statistic for joint significance 
Using the OLS residuals obtained from the of (age 25), male, and e401k is about 30.8 if we 
Ss ne ee cece A ee use the R-squareds reported in Table 8.1. With 3 and 
a t statistic of 2.96. Does it appear we 2,012 degrees of Tondon, the p-value is zero to more 
should worry about heteroskedasticity in the than 15 decimal places; of course, this is not surpris- 
financial wealth equation? ing given the very large t statistics for the age and 


401(k) variables. 


TABLE 8.1 Dependent Variable: nettfa 


Independent (1) (2) (3) (4) 
Variables OLS WLS OLS WLS 
inc 821 187 TTA .740 
(.104) (.063) (.100) (.064) 
(age — 25)? — — .0251 .0175 
(.0043) (.0019) 

male — — 2.48 1.84 
(2.06) (1.56) 

e401k — — 6.89 5.19 
(2.29) (1.70) 

intercept —10.57 —9.58 —20.98 —16.70 
(2.53) (1.65) (3.50) (1.96) 
Observations 2,017 2,017 2,017 2,017 
R-squared .0827 .0709 1279 Slat 5 
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Assuming that the error variance in the financial wealth equation has a variance proportional to 
income is essentially arbitrary. In fact, in most cases, our choice of weights in WLS has a degree of 
arbitrariness. However, there is one case in which the weights needed for WLS arise naturally from 
an underlying econometric model. This happens when, instead of using individual-level data, we only 
have averages of data across some group or geographic region. For example, suppose we are inter- 
ested in determining the relationship between the amount a worker contributes to his or her 401(k) 
pension plan as a function of the plan generosity. Let i denote a particular firm and let e denote an 
employee within the firm. A simple model is 


contrib, = By + Byearns;, + Brage,,, + Bymrate; + Uie, [8.28] 


where contrib, , is the annual contribution by employee e who works for firm i, earns; is annual earn- 
ings for this person, and age; is the person’s age. The variable mrate; is the amount the firm puts into 
an employee’s account for every dollar the employee contributes. 

If (8.28) satisfies the Gauss-Markov assumptions, then we could estimate it, given a sample of 
individuals across various employers. Suppose, however, that we only have average values of contri- 
butions, earnings, and age by employer. In other words, individual-level data are not available. Thus, 
let contrib; denote average contribution for people at firm i, and similarly for earns; and age;. Let m; 
denote the number of employees at firm i; we assume that this is a known quantity. Then, if we aver- 
age equation (8.28) across all employees at firm 7, we obtain the firm-level equation 


contrib; = By + B,earns; + Bage; + Bymrate; + u; [8.29] 


mi 


where u; = m; ' >", Uje is the average error across all employees in firm i. If we have n firms in our 
sample, then (8.29) is just a standard multiple linear regression model that can be estimated by OLS. 
The estimators are unbiased if the original model (8.28) satisfies the Gauss-Markov assumptions and 
the individual errors u,, are independent of the firm’s size, m; [because then the expected value of u; 
given the explanatory variables in (8.29), is zero]. 

If the individual-level equation (8.28) satisfies the homoskedasticity assumption, and the errors 
within firm i are uncorrelated across employees, then we can show that the firm-level equation 
(8.29) has a particular kind of heteroskedasticity. Specifically, if Var(u;,,) = o? for all i and e, and 
Cov(u;,,U;¢) = 0 for every pair of employees e + g within firm i, then Var(u,) = o7/m,; this is just 
the usual formula for the variance of an average of uncorrelated random variables with common vari- 
ance. In other words, the variance of the error term u; decreases with firm size. In this case, h; = 1/m,, 
and so the most efficient procedure is weighted least squares, with weights equal to the number of 
employees at the firm (1/h; = m;). This ensures that larger firms receive more weight. This gives us 
an efficient way of estimating the parameters in the individual-level model when we only have aver- 
ages at the firm level. 

A similar weighting arises when we are using per capita data at the city, county, state, or coun- 
try level. If the individual-level equation satisfies the Gauss-Markov assumptions, then the error in 
the per capita equation has a variance proportional to one over the size of the population. Therefore, 
weighted least squares with weights equal to the population is appropriate. For example, suppose we 
have city-level data on per capita beer consumption (in ounces), the percentage of people in the popu- 
lation over 21 years old, average adult education levels, average income levels, and the city price of 
beer. Then, the city-level model 


beerpc = By + B,perc21 + B,avgeduc + B3incpc + Byprice + u 


can be estimated by weighted least squares, with the weights being the city population. 

The advantage of weighting by firm size, city population, and so on relies on the underlying 
individual equation being homoskedastic. If heteroskedasticity exists at the individual level, then the 
proper weighting depends on the form of heteroskedasticity. Further, if there is correlation across 
errors within a group (say, firm), then Var(u;) # o7/m;; see Problem 7. Uncertainty about the form of 
Var(u;) in equations such as (8.29) is why more and more researchers simply use OLS and compute 
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robust standard errors and test statistics when estimating models using per capita data. An alterna- 
tive is to weight by group size but to report the heteroskedasticity-robust statistics in the WLS esti- 
mation. This ensures that, while the estimation is efficient if the individual-level model satisfies the 
Gauss-Markov assumptions, heteroskedasticity at the individual level or within-group correlation are 
accounted for through robust inference. 


8-4b The Heteroskedasticity Function 
Must Be Estimated: Feasible GLS 


In the previous subsection, we saw some examples of where the heteroskedasticity is known up to a 
multiplicative form. In most cases, the exact form of heteroskedasticity is not obvious. In other words, it 
is difficult to find the function h(x;) of the previous section. Nevertheless, in many cases we can model 
the function h and use the data to estimate the unknown parameters in this model. This results in an esti- 
mate of each h,, denoted as h ; Using h, instead of h; in the GLS transformation yields an estimator called 
the feasible GLS (FGLS) estimator. Feasible GLS is sometimes called estimated GLS, or EGLS. 

There are many ways to model heteroskedasticity, but we will study one particular, fairly flexible 
approach. Assume that 


Var(ulx) = o’exp(5y + ôx, + 55x, +--+ + Sx), [8.30] 


where x,, X2, . . . , 2X, are the independent variables appearing in the regression model [see equation 
(8.1)], and the 5, are unknown parameters. Other functions of the x; can appear, but we will focus primar- 
ily on (8.30). In the notation of the previous subsection, h(x) = em: + ôx + 6x, +--+ + ex). 

You may wonder why we have used the exponential function in (8.30). After all, when testing 
for heteroskedasticity using the Breusch-Pagan test, we assumed that heteroskedasticity was a linear 
function of the x,. Linear alternatives such as (8.12) are fine when testing for heteroskedasticity, but 
they can be problematic when correcting for heteroskedasticity using weighted least squares. We have 
encountered the reason for this problem before: linear models do not ensure that predicted values are 
positive, and our estimated variances must be positive in order to perform WLS. 

If the parameters 6; were known, then we would just apply WLS, as in the previous subsection. 
This is not very realistic. It is better to use the data to estimate these parameters, and then to use these 
estimates to construct weights. How can we estimate the 6;? Essentially, we will transform this equa- 
tion into a linear form that, with slight modification, can be estimated by OLS. 

Under assumption (8.30), we can write 


uw = o’exp(6y + ôx + Sox, +--+ + exv, 


where v has a mean equal to unity, conditional on x = (x, X2, . . - , %,). If we assume that v is actually 
independent of x, we can write 
log(u?) = ay + ixi + ox. +--+ + Oxy, + e, [8.31] 


where e has a zero mean and is independent of x; the intercept in this equation is different from 6, but this 
is not important in implementing WLS. The dependent variable is the log of the squared error. Because 
(8.31) satisfies the Gauss-Markov assumptions, we can get unbiased estimators of the 6; by using OLS. 

As usual, we must replace the unobserved u with the OLS residuals. Therefore, we run the 
regression of 


loe( 2?) 01.2), Zigcs 05 Mee [8.32] 


Actually, what we need from this regression are the fitted values; call these ,. Then, the estimates of 
h, are simply 


h, = exp(@;). [8.33] 


We now use WLS with weights Wh, in place of 1/h; in equation (8.27). We summarize the steps. 
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A Feasible GLS Procedure to Correct for Heteroskedasticity: 
. Run the regression of y on x), X2, . . . , x, and obtain the residuals, i. 

Create log(ĝ°) by first squaring the OLS residuals and then taking the natural log. 

Run the regression in equation (8.32) and obtain the fitted values, ĝ. 


. Exponentiate the fitted values from (8.32): h = exp(@). 


ne WN 


. Estimate the equation 
y= Pot Pim +--+ Bey tu 


by WLS, using weights 1/h. In other words, we replace h; with h, in equation (8.27). Remember, 
the squared residual for observation i gets weighted by 1/h,. If instead we first transform all vari- 
ables and run OLS, each variable gets multiplied by Wh, including the intercept. 


If we could use h; rather than h, in the WLS procedure, we know that our estimators would be 
unbiased; in fact, they would be the best linear unbiased estimators, assuming that we have properly 
modeled the heteroskedasticity. Having to estimate h, using the same data means that the FGLS esti- 
mator is no longer unbiased (so it cannot be BLUE, either). Nevertheless, the FGLS estimator is con- 
sistent and asymptotically more efficient than OLS. This is difficult to show because of estimation of 
the variance parameters. But if we ignore this—as it turns out we may—the proof is similar to show- 
ing that OLS is efficient in the class of estimators in Theorem 5.3. At any rate, for large sample sizes, 
FGLS is an attractive alternative to OLS when there is evidence of heteroskedasticity that inflates the 
standard errors of the OLS estimates. 

We must remember that the FGLS estimators are estimators of the parameters in the usual popu- 
lation model 


y = Bo + Bix +: + Bere + u. 


Just as the OLS estimates measure the marginal impact of each x; on y, so do the FGLS estimates. We 
use the FGLS estimates in place of the OLS estimates because the FGLS estimators are more efficient 
and have associated test statistics with the usual ¢ and F distributions, at least in large samples. If we 
have some doubt about the variance specified in equation (8.30), we can use heteroskedasticity-robust 
standard errors and test statistics in the transformed equation. 

Another useful alternative for estimating h; is to replace the independent variables in regression 
(8.32) with the OLS fitted values and their squares. In other words, obtain the g; as the fitted values 
from the regression of 


log(a) on $, F [8.34] 


and then obtain the h, exactly as in equation (8.33). This changes only step (3) in the previous 
procedure. 

If we use regression (8.32) to estimate the variance function, you may be wondering if we can 
simply test for heteroskedasticity using this same regression (an F or LM test can be used). In fact, 
Park (1966) suggested this. Unfortunately, when compared with the tests discussed in Section 8-3, 
the Park test has some problems. First, the null hypothesis must be something stronger than homo- 
skedasticity: effectively, u and x must be independent. This is not required in the Breusch-Pagan or 
White tests. Second, using the OLS residuals i in place of u in (8.32) can cause the F statistic to 
deviate from the F distribution, even in large sample sizes. This is not an issue in the other tests we 
have covered. For these reasons, the Park test is not recommended when testing for heteroskedasticity. 
Regression (8.32) works well for weighted least squares because we only need consistent estimators 
of the ô;, and regression (8.32) certainly delivers those. 
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EXAMPLE 8.7 Demand for Cigarettes 


We use the data in SMOKE to estimate a demand function for daily cigarette consumption. Because 
most people do not smoke, the dependent variable, cigs, is zero for most observations. A linear model 
is not ideal because it can result in negative predicted values. Nevertheless, we can still learn some- 
thing about the determinants of cigarette smoking by using a linear model. 

The equation estimated by ordinary least squares, with the usual OLS standard errors in paren- 


theses, is 
cigs = —3.64 + .880 log(income) — .751 log(cigpric) 
(24.08) (.728) (5.773) 
—.501 educ + .771 age — .0090 age” — 2.83 restaurn [8.35] 
(.167) (.160) (.0017) (1.11) 
n = 807, R? = .0526, 
where 


cigs = number of cigarettes smoked per day. 
income = annual income. 
cigpric = the per-pack price of cigarettes (in cents). 
educ = years of schooling. 
age = age measured in years. 
restaurn = a binary indicator equal to unity if the person resides in a state with restaurant 
smoking restrictions. 


Because we are also going to do weighted least squares, we do not report the heteroskedasticity- 
robust standard errors for OLS. (Incidentally, 13 out of the 807 fitted values are less than zero; this is 
less than 2% of the sample and is not a major cause for concern.) 

Neither income nor cigarette price is statistically significant in (8.35), and their effects are 
not practically large. For example, if income increases by 10%, cigs is predicted to increase by 
(.880/100)(10) = .088, or less than one-tenth of a cigarette per day. The magnitude of the price 
effect is similar. 

Each year of education reduces the average cigarettes smoked per day by one-half of a cigarette, 
and the effect is statistically significant. Cigarette smoking is also related to age, in a quadratic fash- 
ion. Smoking increases with age up until age = .771/[2(.009) | = 42.83, and then smoking decreases 
with age. Both terms in the quadratic are statistically significant. The presence of a restriction on 
smoking in restaurants decreases cigarette smoking by almost three cigarettes per day, on average. 

Do the errors underlying equation (8.35) contain heteroskedasticity? The Breusch-Pagan regres- 
sion of the squared OLS residuals on the independent variables in (8.35) [see equation (8.14)] pro- 
duces R72 = .040. This small R-squared may seem to indicate no heteroskedasticity, but we must 
remember to compute either the F or LM statistic. If the sample size is large, a seemingly small R72 can 
result in a very strong rejection of homoskedasticity. The LM statistic is LM = 807(.040) = 32.28, 
and this is the outcome of a yg random variable. The p-value is less than .000015, which is very 
strong evidence of heteroskedasticity. 

Therefore, we estimate the equation using the feasible GLS procedure based on equation (8.32). 
The weighted least squares estimates are 


Cigs = 5.64 + 1.30 log(income) — 2.94 log(cigpric) 


(17.80) (.44) (4.46) 
—.463 educ + .482 age — .0056 age? — 3.46 restaurn [8.36] 
(.120) (.097) (.0009) (.80) 


n = 807, R? = .1134. 


CHAPTER 8 Heteroskedasticity 281 


The income effect is now statistically significant and larger in magnitude. The price effect is also 
notably bigger, but it is still statistically insignificant. [One reason for this is that cigpric varies only 
across states in the sample, and so there is much less variation in log(cigpric) than in log(income), 
educ, and age.] 

The estimates on the other variables have, naturally, changed somewhat, but the basic story is still 
the same. Cigarette smoking is negatively related to schooling, has a quadratic relationship with age, 
and is negatively affected by restaurant smoking restrictions. 


We must be a little careful in computing F statistics for testing multiple hypotheses after estima- 
tion by WLS. (This is true whether the sum of squared residuals or R-squared form of the F statistic 
is used.) It is important that the same weights be used to estimate the unrestricted and restricted 
models. We should first estimate the unrestricted model by OLS. Once we have obtained the weights, 
we can use them to estimate the restricted model as well. The F statistic can be computed as usual. 
Fortunately, many regression packages have a simple command for testing joint restrictions after 
WLS estimation, so we need not perform the restricted regression ourselves. 

Example 8.7 hints at an issue that sometimes arises in applications of weighted least squares: the 
OLS and WLS estimates can be substantially different. This is not such a big problem in the demand 
for cigarettes equation because all the coefficients maintain the same signs, and the biggest changes 


are on variables that were statistically insignificant 
s. GOING FURTHER 8.4 


when the equation was estimated by OLS. The OLS 
and WLS estimates will always differ due to sam- 

Let G; be the WLS residuals from (8.36), 

which are not weighted, and let cigs; be 


the fitted values. (These are obtained 
using the same formulas as OLS; they dif- 
fer because of different estimates of the 
B;.) One way to determine whether hetero- 
skedasticity has been eliminated is to use 
the G2/h, = (G/V/h,)? in a test for heteroske- 
dasticity. [If h; = Var(ux;), then the trans- 
formed residuals should have little evidence 
of heteroskedasticity.] There are many pos- 
sibilities, but one—based on White’s test 
in the transformed equation—is to regress 
a2/h, on Cigs;/W/h; and Cigs?2/h;, (including an 
intercept). The joint F statistic when we use 
SMOKE is 11.15. Does it appear that our 
correction for heteroskedasticity has actually 
eliminated the heteroskedasticity? 


pling error. The issue is whether their difference is 
enough to change important conclusions. 

If OLS and WLS produce statistically significant 
estimates that differ in sign—for example, the OLS 
price elasticity is positive and significant, while the 
WLS price elasticity is negative and significant— 
or the difference in magnitudes of the estimates is 
practically large, we should be suspicious. Typically, 
this indicates that one of the other Gauss-Markov 
assumptions is false, particularly the zero condi- 
tional mean assumption on the error (MLR.4). If 
E(y|x) # Bo + Bix, +- + Bey, then OLS and 
WLS have different expected values and probability 
limits. For WLS to be consistent for the £, it is not 
enough for u to be uncorrelated with each Xj; we need 
the stronger assumption MLR.4 in the linear model 
MLR.1. Therefore, a significant difference between 
OLS and WLS can indicate a functional form mis- 


specification in E(y|x). The Hausman test [Hausman (1978)] can be used to formally compare the 
OLS and WLS estimates to see if they differ by more than sampling error suggests they should, but 
this test is beyond the scope of this text. In many cases, an informal “eyeballing” of the estimates is 
sufficient to detect a problem. 


8-4c What If the Assumed Heteroskedasticity Function Is Wrong? 


We just noted that if OLS and WLS produce very different estimates, it is likely that the conditional 
mean E(y|x) is misspecified. What are the properties of WLS if the variance function we use is mis- 
specified in the sense that Var(y|x) # o7h(x) for our chosen function h(x)? The most important issue 
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is whether misspecification of h(x) causes bias or inconsistency in the WLS estimator. Fortunately, the 
answer is no, at least under MLR.4. Recall that, if E(u|x) = 0, then any function of x is uncorrelated 
with u, and so the weighted error, u/V h(x), is uncorrelated with the weighted regressors, x;/‘V h(x), 
for any function h(x) that is always positive. This is why, as we just discussed, we can take large dif- 
ferences between the OLS and WLS estimators as indicative of functional form misspecification. If 
we estimate parameters in the function, say h(x, 5), then we can no longer claim that WLS is unbi- 
ased, but it will generally be consistent (whether or not the variance function is correctly specified). 
If WLS is at least consistent under MLR.1 to MLR.4, what are the consequences of using WLS 
with a misspecified variance function? There are two. The first, which is very important, is that the 
usual WLS standard errors and test statistics, computed under the assumption that Var(y|x) = o7h(x), 
are no longer valid, even in large samples. For example, the WLS estimates and standard errors in column (4) 
of Table 8.1 assume that Var(nettfalinc, age, male, e401k) = Var(nettfa\inc) = inc; so we are 
assuming not only that the variance depends just on income, but also that it is a linear function of income. 
If this assumption is false, the standard errors (and any statistics we obtain using those standard errors) are 
not valid. Fortunately, there is an easy fix: just as we can obtain standard errors for the OLS estimates that 
are robust to arbitrary heteroskedasticity, we can obtain standard errors for WLS that allow the variance 
function to be arbitrarily misspecified. It is easy to see why this works. Write the transformed equation as 


y/Vh; = Bo(1/Vh;) + Bi(xn/V hi) ete i B(x! hi) a ui V hy. 


Now, if Var(u;|x;) # o7h;, then the weighted error ul Vh, is heteroskedastic. So we can just apply 
the usual heteroskedasticity-robust standard errors after estimating this equation by OLS—which, 
remember, is identical to WLS. 

To see how robust inference with WLS works in practice, column (1) of Table 8.2 reproduces the 
last column of Table 8.1, and column (2) contains standard errors robust to Var(u,|x;) # oinc;. 

The standard errors in column (2) allow the variance function to be misspecified. We see that, for 
the income and age variables, the robust standard errors are somewhat above the usual WLS standard 
errors—certainly by enough to stretch the confidence intervals. On the other hand, the robust standard 
errors for male and e40/k are actually smaller than those that assume a correct variance function. We 
saw this could happen with the heteroskedasticity-robust standard errors for OLS, too. 

Even if we use flexible forms of variance functions, such as that in (8.30), there is no guarantee 
that we have the correct model. While exponential heteroskedasticity is appealing and reasonably flex- 
ible, it is, after all, just a model. Therefore, it is always a good idea to compute fully robust standard 
errors and test statistics after WLS estimation. 


TABLE 8.2 WLS Estimation of the nettfa Equation 


Independent With Nonrobust With Robust 
Variables Standard Errors Standard Errors 
inc 740 740 
(.064) (.075) 
(age — 25)" 0175 0175 
(.0019) (.0026) 
male 1.84 1.84 
(1.56) (1.31) 
e401k 5.19 5.19 
(1.70) (1.57) 
intercept —16.70 —16.70 
(1.96) (2.24) 
Observations 2,017 2,017 
R-squared S 1115 
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A modern criticism of WLS is that if the variance function is misspecified, it is not guaranteed 
to be more efficient than OLS. In fact, that is the case: if Var(y|x) is neither constant nor equal to 
o°h(x), where h(x) is the proposed model of heteroskedasticity, then we cannot rank OLS and WLS 
in terms of variances (or asymptotic variances when the variance parameters must be estimated). 
However, this theoretically correct criticism misses an important practical point. Namely, in cases of 
strong heteroskedasticity, it is often better to use a wrong form of heteroskedasticity and apply WLS 
than to ignore heteroskedasticity altogether in estimation and use OLS. Models such as (8.30) can 
well approximate a variety of heteroskedasticity functions and may produce estimators with smaller 
(asymptotic) variances. Even in Example 8.6, where the form of heteroskedasticity was assumed to 
have the simple form Var(nertfa|x) = o’inc, the fully robust standard errors for WLS are well below 
the fully robust standard errors for OLS. (Comparing robust standard errors for the two estimators 
puts them on equal footing: we assume neither homoskedasticity nor that the variance has the form 
a’ inc.) For example, the robust standard error for the WLS estimator of B; is about .075, which is 
25% lower than the robust standard error for OLS (about .100). For the coefficient on (age — 25)’, 
the robust standard error of WLS is about .0026, almost 40% below the robust standard error for OLS 
(about .0043). 


8-4d Prediction and Prediction Intervals with Heteroskedasticity 


If we start with the standard linear model under MLR.1 to MLR.4, but allow for heteroskedasticity 
of the form Var(y|x) = o7h(x) [see equation (8.21)], the presence of heteroskedasticity affects the 
point prediction of y only insofar as it affects estimation of the 6;. Of course, it is nara to use WLS 
on a sample of size n to obtain the b; Our prediction of an unobserved outcome, y, iven known 
values of the explanatory variables x°, has the same form as in Section 6-4: $° = = ĝ +x B. This 
makes sense: once we know E(y|x), we base our prediction on it; the structure of Var(y|x) plays no 
direct role. 

On the other hand, prediction intervals do depend directly on the nature of Var(y|x). Recall in 
Section 6-4 that we constructed a prediction interval under the classical linear model assumptions. 
Suppose now that all the CLM assumptions hold except that (8.21) replaces the homoskedasticity 
assumption, MLR.5. We know that the WLS estimators are BLUE and, because of normality, have (con- 
ditional) normal distributions. We can obtain se($°) using the same method in Section 6-4, except that 
now we use WLS. [A simple approach is to write y; = 09 + B(x — x2) +- + Bx — x2) + u; 
where the x are the values of the explanatory variables for which we want a a predičied value of y. 
We can estimate this equation by WLS ‘and then obtain $? = 65 and se(°) = se(@o).] We also need 
to estimate the standard deviation of u’, the unobserved part of y’. But Var(w°|x = x°) = o7h(x°), 
and so se(u”) = &V h(x°), where & is the standard error of the regression from the WLS estimation. 
Therefore, a 95% prediction interval is 


A 


P E tos * se(@°), [8.37] 


where se(@°) = {[se($°) P + &7h(x°)}!. 

This interval is exact only if we do not have to estimate the variance function. If we estimate 
parameters, as in model (8.30), then we cannot obtain an exact interval. In fact, accounting for 
the estimation error in the Ê; and the ô, (the variance parameters) becomes very difficult. We saw 
two examples in Section 6-4 engi the estimation error in the parameters was swamped by the 
variation in the unobservables, u°. Therefore, we might still use equation (8.37) with h(x°) simply 
replaced by h(x°). In fact, if we are to ignore the parameter estimation error entirely, we can drop 
se($°) from se(@°). [Remember, se($°) converges to zero at the rate 1/Vn, while se(i°) is roughly 
constant. | 

We can also obtain a prediction for y in the model 


log(y) = Bo + Bix, to + BX +u, [8.38] 
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where u is heteroskedastic. We assume that u has a conditional normal distribution with a specific form of 
heteroskedasticity. We assume the exponential form in equation (8.30), but add the normality assumption: 


ulx), X2, ..., X4 ~ Normall[0, explo + ixi + + xp) ]. [8.39] 


As a notational shorthand, write the variance function as exp(ôọ + xô). Then, because log(y) given x 
has a normal distribution with mean By) + xP and variance exp(ôọ + x6), it follows that 


E(y|x) = exp(By + xB + explo + xd)/2). [8.40] 


Now, we estimate the £; and 6; using WLS estimation of (8.38). That is, after using OLS to obtain the 
residuals, run the regression in (8.32) to obtain fitted values, 


8 = âo + bx, +. + boxy [8.41] 


and then compute the h, as in (8.33). Using these Î, obtain the WLS estimates, Ê, and also com- 
pute 6? from the weighted squared residuals. Now, compared with the original model for 
Var(u|x), 59 = @ + log(o?), and so Var(u|x) = o exp(a + ôx, +- + ôx). Therefore, the 
estimated variance is 6? exp(é;) = &h,, and the fitted value for y; is 


5, = exp(logy, + 67h,/2). [8.42] 


We can use these fitted values to obtain an R-squared measure, as described in Section 6-4: use the 
squared correlation coefficient between y; and ĵ;. 
For any values of the explanatory variables x°, we can estimate E(y|x = x°) as 


E(y|x = x°) = exp(By + x°B + 6? exp(G + x°S)/2), [8.43] 


where 


A 


B; = the WLS estimates. 
Qo = the intercept in (8.41). 


ô; = the slopes from the same regression. 
ô? is obtained from the WLS estimation. 


Obtaining a proper standard error for the prediction in (8.42) is very complicated analytically, but, as 
in Section 6-4, it would be fairly easy to obtain a standard error using a resampling method such as 
the bootstrap described in Appendix 6A. 

Obtaining a prediction interval is more of a challenge when we estimate a model for heteroske- 
dasticity, and a full treatment is complicated. Nevertheless, we saw in Section 6-4 two examples where 
the error variance swamps the estimation error, and we would make only a small mistake by ignoring 
the estimation error in all parameters. Using arguments similar to those in Section 6-4, an approximate 
95% prediction interval (for large sample sizes) is exp[—1.96 + ¢V/h(x°) ] exp(B, + x°B) to 
exp[ 1.96 - 6V hi(x°) ] exp(By + x°B), where f(x°) is the estimated variance function evaluated at 
x", A(x?) = exp(@ + 6x9 +- + 6,22). As in Section 6-4, we obtain this approximate interval by 
simply exponentiating the endpoints. 


8-5 The Linear Probability Model Revisited 


As we saw in Section 7-5, when the dependent variable y is a binary variable, the model must contain 
heteroskedasticity, unless all of the slope parameters are zero. We are now in a position to deal with 
this problem. 

The simplest way to deal with heteroskedasticity in the linear probability model is to continue to 
use OLS estimation, but to also compute robust standard errors in test statistics. This ignores the fact 
that we actually know the form of heteroskedasticity for the LPM. Nevertheless, OLS estimation of 
the LPM is simple and often produces satisfactory results. 
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EXAMPLE 8.8 Labor Force Participation of Married Women 


In the labor force participation example in Section 7-5 [see equation (7.29)], we reported the usual 
OLS standard errors. Now, we compute the heteroskedasticity-robust standard errors as well. These 
are reported in brackets below the usual standard errors: 


inlf = .586 — .0034 nwifeinc + .038 educ + .039 exper 


(.154) (.0014) (.007) (.006) 
[.151] [.0015] [.007 | [.006 | 
—.00060 exper? — .016 age — .262 kidslt6 + .0130 kidsge6 [8.44] 
(.00018) (.002) (.034) (.0132) 
[.00019] [.002] [.032] [.0135] 


n = 753, R° = .264. 


Several of the robust and OLS standard errors are the same to the reported degree of precision; in all 
cases, the differences are practically very small. Therefore, while heteroskedasticity is a problem in 
theory, it is not in practice, at least not for this example. It often turns out that the usual OLS standard 
errors and test statistics are similar to their heteroskedasticity-robust counterparts. Furthermore, it 
requires a minimal effort to compute both. 


Generally, the OLS estimators are inefficient in the LPM. Recall that the conditional variance of 
y in the LPM is 


Var(y|x) = p(x)[1 — p(x)], [8.45] 
where 
p(x) = Bo + Bix, to + Bory [8.46] 


is the response probability (probability of success, y = 1). It seems natural to use weighted least 
squares, but there are a couple of hitches. The probability p(x) clearly depends on the unknown popu- 
lation parameters, 6;. Nevertheless, we do have unbiased estimators of these parameters, namely the 
OLS estimators. When the OLS estimators are plugged into equation (8.46), we obtain the OLS fitted 
values. Thus, for each observation i, Var(y,|x;) is estimated by 


h; = 3,1 — $), [8.47] 


where ĵ; is the OLS fitted value for observation i. Now, we apply feasible GLS, just as in Section 8-4. 

Unfortunately, being able to estimate h; for each i does not mean that we can proceed directly 
with WLS estimation. The problem is one that we briefly discussed in Section 7-5: the fitted values 
ĵi need not fall in the unit interval. If either }; < 0 or ĵ; > 1, equation (8.47) shows that h, will be 
negative. Because WLS proceeds by multiplying observation i by 1/ Vit, the method will fail if h, is 
negative (or zero) for any observation. In other words, all of the weights for WLS must be positive. 

In some cases, 0 < ĵ; < 1 for all i, in which case WLS can be used to estimate the LPM. In 
cases with many observations and small probabilities of success or failure, it is very common to find 
some fitted values outside the unit interval. If this happens, as it does in the labor force participation 
example in equation (8.44), it is easiest to abandon WLS and to report the heteroskedasticity-robust 
statistics. An alternative is to adjust those fitted values that are less than zero or greater than unity, and 
then to apply WLS. One suggestion is to set ĵ; = .01 if ĵ; < 0 and ĵ; = .99 if ĵ; > 1. Unfortunately, 
this requires an arbitrary choice on the part of the researcher—for example, why not use .001 and .999 
as the adjusted values? If many fitted values are outside the unit interval, the adjustment to the fitted 
values can affect the results; in this situation, it is probably best to just use OLS. 
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Estimating the Linear Probability Model by Weighted Least Squares: 
1. Estimate the model by OLS and obtain the fitted values, ĵ. 


2. Determine whether all of the fitted values are inside the unit interval. If so, proceed to step (3). 
If not, some adjustment is needed to bring all fitted values into the unit interval. 


3. Construct the estimated variances in equation (8.47). 


4. Estimate the equation 
y = Bot Bix, to + Boy + u 


by WLS, using weights Wh. 


EXAMPLE 8.9 Determinants of Personal Computer Ownership 


We use the data in GPA1 to estimate the probability of owning a computer. Let PC denote a binary 
indicator equal to unity if the student owns a computer, and zero otherwise. The variable hsGPA is 
high school GPA, ACT is achievement test score, and parcoll is a binary indicator equal to unity if 
at least one parent attended college. (Separate college indicators for the mother and the father do not 
yield individually significant results, as these are pretty highly correlated.) 

The equation estimated by OLS is 


PC = —.0004 + .065 hsGPA + .0006 ACT + .221 parcoll 
(.4905) (.137) (.0155) (.093) 
[.4888] [.139] [.0158] [.087] 

n = 141, R? = .0415. 


[8.48] 


Just as with Example 8.8, there are no striking differences between the usual and robust standard 
errors. Nevertheless, we also estimate the model by WLS. Because all of the OLS fitted values are 
inside the unit interval, no adjustments are needed: 
PC = .026 + .033 hsGPA + .0043 ACT + .215 parcoll 
(.477) (.130) (.0155) (.086) [8.49] 
n = 142, R? = .0464. 
There are no important differences in the OLS and WLS estimates. The only significant explanatory 


variable is parcoll, and in both cases we estimate that the probability of PC ownership is about .22 
higher if at least one parent attended college. 


Summary 


We began by reviewing the properties of ordinary least squares in the presence of heteroskedasticity. 
Heteroskedasticity does not cause bias or inconsistency in the OLS estimators, but the usual standard errors 
and test statistics are no longer valid. We showed how to compute heteroskedasticity-robust standard errors 
and f statistics, something that is routinely done by many regression packages. Most regression packages 
also compute a heteroskedasticity-robust F-type statistic. 

We discussed two common ways to test for heteroskedasticity: the Breusch-Pagan test and a special 
case of the White test. Both of these statistics involve regressing the squared OLS residuals on either the 
independent variables (BP) or the fitted and squared fitted values (White). A simple F test is asymptotically 
valid; there are also Lagrange multiplier versions of the tests. 
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OLS is no longer the best linear unbiased estimator in the presence of heteroskedasticity. When the 
form of heteroskedasticity is known, GLS estimation can be used. This leads to weighted least squares as a 
means of obtaining the BLUE estimator. The test statistics from the WLS estimation are either exactly valid 
when the error term is normally distributed or asymptotically valid under nonnormality. This assumes, of 
course, that we have the proper model of heteroskedasticity. 

More commonly, we must estimate a model for the heteroskedasticity before applying WLS. The 
resulting feasible GLS estimator is no longer unbiased, but it is consistent and asymptotically efficient. The 
usual statistics from the WLS regression are asymptotically valid. We discussed a method to ensure that 
the estimated variances are strictly positive for all observations, something needed to apply WLS. 

As we discussed in Chapter 7, the linear probability model for a binary dependent variable necessarily 
has a heteroskedastic error term. A simple way to deal with this problem is to compute heteroskedastic- 
ity-robust statistics. Alternatively, if all the fitted values (that is, the estimated probabilities) are strictly 
between zero and one, weighted least squares can be used to obtain asymptotically efficient estimators. 


Key Terms 


Breusch-Pagan Test for Heteroskedasticity-Robust F Heteroskedasticity-Robust 
Heteroskedasticity (BP Test) Statistic t Statistic 
Feasible GLS (FGLS) Estimator Heteroskedasticity-Robust LM Weighted Least Squares (WLS) 
Generalized Least Squares (GLS) Statistic Estimators 
Estimators Heteroskedasticity-Robust White Test for Heteroskedasticity 
Heteroskedasticity of Unknown Standard Error 
Form 


Problems 


1 Which of the following are consequences of heteroskedasticity? 
(i) The OLS estimators, Ê; are inconsistent. 
(ii) The usual F statistic no longer has an F distribution. 
(iii) The OLS estimators are no longer BLUE. 


2 Consider a linear model to explain monthly beer consumption: 
beer = By + B,inc + Bprice + Beduc + Byfemale + u 
E(uļinc, price, educ, female) = 0 
Var(ulinc, price, educ, female) = o inc’. 
Write the transformed equation that has a homoskedastic error term. 
3 True or False? WLS is preferred to OLS when an important variable has been omitted from the model. 
4 Using the data in GPA3, the following equation was estimated for the fall and second semester students: 


irmgpa = —2.12 + .900 crsgpa + .193 cumgpa + .0014 tothrs 


(.55) (.175) (.064) (.0012) 
[.55] [.166] [.074] [.0012] 

+ .0018 sat — .0039 hsperc + .351 female — .157 season 
(.0002) (.0018) (.085) (.098) 
[.0002 ] [.0019] [.079] [.080] 


269, R? = .465. 


3 
Il 
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Here, trmgpa is term GPA, crsgpa is a weighted average of overall GPA in courses taken, cumgpa is 

GPA prior to the current semester, tothrs is total credit hours prior to the semester, sat is SAT score, 

hsperc is graduating percentile in high school class, female is a gender dummy, and season is a dummy 

variable equal to unity if the student’s sport is in season during the fall. The usual and heteroskedastic- 
ity-robust standard errors are reported in parentheses and brackets, respectively. 

(i) Do the variables crsgpa, cumgpa, and tothrs have the expected estimated effects? Which of 
these variables are statistically significant at the 5% level? Does it matter which standard errors 
are used? 

(ii) Why does the hypothesis Ho: Bersgpya = 1 make sense? Test this hypothesis against the two-sided 
alternative at the 5% level, using both standard errors. Describe your conclusions. 

(iii) Test whether there is an in-season effect on term GPA, using both standard errors. Does the 
significance level at which the null can be rejected depend on the standard error used? 


The variable smokes is a binary variable equal to one if a person smokes, and zero otherwise. Using the 
data in SMOKE, we estimate a linear probability model for smokes: 


Smokes = .656 — .069 log(cigpric) + .012 log(income) — .029 educ 


(.855) (.204) (.026) (.006) 

[.856] [.207] [.026] [.006] 

+ .020 age — .00026 age? — .101 restaurn — .026 white 
(.006) (.00006) (.039) (.052) 
[.005] [.00006 ] [.038] [.050] 


n = 807, R? = .062. 


The variable white equals one if the respondent is white, and zero otherwise; the other independent vari- 

ables are defined in Example 8.7. Both the usual and heteroskedasticity-robust standard errors are reported. 

(i) Are there any important differences between the two sets of standard errors? 

(ii) Holding other factors fixed, if education increases by four years, what happens to the estimated 
probability of smoking? 

Gii) At what point does another year of age reduce the probability of smoking? 

(iv) Interpret the coefficient on the binary variable restaurn (a dummy variable equal to one if the 
person lives in a state with restaurant smoking restrictions). 

(v) Person number 206 in the data set has the following characteristics: cigpric = 67.44, 
income = 6,500, educ = 16, age = 77, restaurn = 0, white = 0, and smokes = 0. Compute 
the predicted probability of smoking for this person and comment on the result. 


There are different ways to combine features of the Breusch-Pagan and White tests for heteroskedastic- 
ity. One possibility not covered in the text is to run the regression 


a2 AQ soa 
i; ON Xj, Xin o- -a Xip Viet = 1,...,0, 


where the i; are the OLS residuals and the ĵ; are the OLS fitted values. Then, we would test joint 

significance of x), Xj, . . . , Xj, and $7. (Of course, we always include an intercept in this regression.) 

(i) | What are the df associated with the proposed F test for heteroskedasticity? 

(ii) Explain why the R-squared from the regression above will always be at least as large as the 
R-squareds for the BP regression and the special case of the White test. 

(iii) Does part (ii) imply that the new test always delivers a smaller p-value than either the BP or 
special case of the White statistic? Explain. 

(iv) Suppose someone suggests also adding ĵ; to the newly proposed test. What do you think of this idea? 
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7 Consider a model at the employee level, 


Yie = Bo + BirXier + Biez to + Burien + Si t Vie 


where the unobserved variable f; is a “firm effect” to each employee at a given firm i. The error 

term v;, is specific to employee e at firm i. The composite error is t;e = fi + Vie, Such as in equa- 

tion (8.28). 

(i) Assume that Var(f,) = oF, Var(v;.) = g, and f and v,, are uncorrelated. Show that 
Var(u;.) = oF + 07; call this o°. 

Gi) Now suppose that for e # g, v;, and v,, are uncorrelated. Show that Cov( u;e U; g) = o$ 

Gii) Letu, = m Dza U; be the average of the composite errors within a firm. Show that 
Var(u;) = o; + oF/m;. 

(iv) Discuss the relevance of part (iii) for WLS estimation using data averaged at the firm level, 
where the weight used for observation 7 is the usual firm size. 


8 The following equations were estimated using the data in ECONMATH. The first equation is for men 
and the second is for women. The third and fourth equations combine men and women. 


Score = 20.52 + 13.60 colgpa + 0.670 act 
(3.72) (0.94) (0.150) 
n = 406. R? = 4025, SSR = 38,781.38. 


Score = 13.79 + 11.89 colgpa + 1.03 act 
(4.11) (1.09) (0.18) 
n = 408, R? = 3666, SSR = 48,029.82. 


Score = 15.60 + 3.17 male + 12.82 colgpa + 0.838 act 
(2.80) (0.73) (0.72) (0.116) 
n = 814, R? = .3946, SSR = 87,128.96. 


Score = 13.79 + 6.73 male + 11.89 colgpa + 1.03 act + 1.72 male + colgpa — 0.364 male * act 
(3.91) (5.55) (1.04) (0.17) (1.44) (0.232) 
n = 814, R? = .3968, SSR = 86,811.20. 


(i) | Compute the usual Chow statistic for testing the null hypothesis that the regression equations 
are the same for men and women. Find the p-value of the test. 

(ii) Compute the usual Chow statistic for testing the null hypothesis that the slope coefficients are 
the same for men and women, and report the p-value. 

(iii) Do you have enough information to compute heteroskedasticity-robust versions of the tests in 
(ii) and (iii)? Explain. 


9 Consider the potential outcomes framework, where w is a binary treatment indicator and the poten- 
tial outcomes are y(0) and y(1). Assume that w is randomly assigned, so that w is independent of 
[v0).y(1)]. Let wo = EL], u, = Ely], o} = Var[y(0)], and o? = Var[y(1)]. 

(i) Define the observed outcome as y = (1 — w)y(0) + wy(1). Letting T = u; — Mo be the average 
treatment effect, show you can write 


y = Mo + Tw + (1 — w)v(0) + wv(1), 
where v(0) = y(O) — py and v(1) = y(1) — py. 
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(ii) Letu = (1 — w)v(O) + wv(1) be the error term in 
y=pottmwrtu 
Show that 
E(ulw) = 0 


What statistical properties does this finding imply about the OLS estimator of 7 from the simply 
regression y; on w; for a random sample of size n? What happens as n > œ? 
(iii) Show that 


Var(ulw) = E(u7|w) = (1 — w)oa + wot. 


Is there generally heteroskedasticity in the error variance? 

(iv) If you think o? # 04, and 7 is the OLS estimator, how would you obtain a valid standard error 
for 7? 

(v) After obtaining the OLS residuals, ĉ;, i = 1,...,, propose a regression that allows consistent 
estimation of oô and of. [Hint: You should first square the residuals. ] 


Computer Exercises 


C1 Consider the following model to explain sleeping behavior: 


sleep = By + B,totwrk + B,educ + Bzage + Bage? + Bsyngkid + Bemale + u. 


(i) | Write down a model that allows the variance of u to differ between men and women. The vari- 
ance should not depend on other factors. 

(ii) Use the data in SLEEP75S to estimate the parameters of the model for heteroskedasticity. (You 
have to first estimate the s/eep equation by OLS to obtain the OLS residuals.) Is the estimated 
variance of u higher for men or for women? 

(iii) Is the variance of u statistically different for men and for women? 


C2 (i) Use the data in HPRICEI to obtain the heteroskedasticity-robust standard errors for equation 
(8.17). Discuss any important differences with the usual standard errors. 
(ii) Repeat part (i) for equation (8.18). 
(iii) What does this example suggest about heteroskedasticity and the transformation used for the 
dependent variable? 


C3 Apply the full White test for heteroskedasticity [see equation (8.19)] to equation (8.18). Using the chi- 
square form of the statistic, obtain the p-value. What do you conclude? 


C4 Use VOTE] for this exercise. 

(i) Estimate a model with voteA as the dependent variable and prtystrA, democA, log(expendA), 
and log(expendB) as independent variables. Obtain the OLS residuals, #;, and regress these on 
all of the independent variables. Explain why you obtain R? = 0. 

(ii) Now, compute the Breusch-Pagan test for heteroskedasticity. Use the F statistic version and 
report the p-value. 

(iii) Compute the special case of the White test for heteroskedasticity, again using the F statistic 
form. How strong is the evidence for heteroskedasticity now? 


C5 Use the data in PNTSPRD for this exercise. 
(i) The variable sprdcvr is a binary variable equal to one if the Las Vegas point spread for a college 
basketball game was covered. The expected value of sprdcvr, say p, is the probability that the 
spread is covered in a randomly selected game. Test Hy: w = .5 against H,: u # .5 at the 10% 


C6 


C7 


c8 


c9 


(ii) 
(iii) 


(iv) 


(v) 
(vi) 
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significance level and discuss your findings. (Hint: This is easily done using a f test by regress- 
ing sprdcvr on an intercept only.) 

How many games in the sample of 553 were played on a neutral court? 

Estimate the linear probability model 


sprdcvr = By + B,favhome + B neutral + B3fav25 + Byund25 + u 


and report the results in the usual form. (Report the usual OLS standard errors and the heteroskedas- 
ticity-robust standard errors.) Which variable is most significant, both practically and statistically? 
Explain why, under the null hypothesis Hy: 8; = B2 = B3 = By = 0, there is no heteroskedas- 
ticity in the model. 

Use the usual F statistic to test the hypothesis in part (iv). What do you conclude? 

Given the previous analysis, would you say that it is possible to systematically predict whether 
the Las Vegas spread will be covered using information available prior to the game? 


In Example 7.12, we estimated a linear probability model for whether a young man was arrested dur- 
ing 1986: 


(i) 


(ii) 
(iii) 


arr86 = By + B,pcnv + Bravgsen + B3tottime + ByptimeS6 + Bsqgemp86 + u. 


Using the data in CRIME], estimate this model by OLS and verify that all fitted values are 
strictly between zero and one. What are the smallest and largest fitted values? 

Estimate the equation by weighted least squares, as discussed in Section 8-5. 

Use the WLS estimates to determine whether avgsen and tottime are jointly significant at the 
5% level. 


Use the data in LOANAPP for this exercise. 


(i) 


(ii) 


Estimate the equation in part (iii) of Computer Exercise C8 in Chapter 7, computing the 
heteroskedasticity-robust standard errors. Compare the 95% confidence interval on Bwni with 
the nonrobust confidence interval. 

Obtain the fitted values from the regression in part (i). Are any of them less than zero? Are any 
of them greater than one? What does this mean about applying weighted least squares? 


Use the data set GPA1 for this exercise. 


G) 
(ii) 


(iii) 


(iv) 


Use OLS to estimate a model relating colGPA to hsGPA, ACT, skipped, and PC. Obtain the 
OLS residuals. 

Compute the special case of the White test for heteroskedasticity. In the regression of a? on 
colGPA,, colGPA?, obtain the fitted values, say h,. 

Verify that the fitted values from part (ii) are all strictly positive. Then, obtain the weighted 
least squares estimates using weights Wh. Compare the weighted least squares estimates for 
the effect of skipping lectures and the effect of PC ownership with the corresponding OLS esti- 
mates. What about their statistical significance? 

In the WLS estimation from part (iii), obtain heteroskedasticity-robust standard errors. In other 
words, allow for the fact that the variance function estimated in part (ii) might be misspecified. 
(See Question 8.4.) Do the standard errors change much from part (iii)? 


In Example 8.7, we computed the OLS and a set of WLS estimates in a cigarette demand equation. 


(i) 
ii) 


(iii) 


Obtain the OLS estimates in equation (8.35). 

Obtain the h, used in the WLS estimation of equation (8.36) and reproduce equation (8.36). From 
this equation, obtain the unweighted residuals and fitted values; call these ù; and ĵ;, respectively. 
(For example, in Stata®, the unweighted residuals and fitted values are given by default.) 

Let ù; = u,/ Vi; and y; = AVA be the weighted quantities. Carry out the special case of the 
White test for heteroskedasticity by regressing ù? on y; y7, being sure to include an intercept, as 
always. Do you find heteroskedasticity in the weighted residuals? 
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(iv) 


(v) 


What does the finding from part (iii) imply about the proposed form of heteroskedasticity used 
in obtaining (8.36)? 

Obtain valid standard errors for the WLS estimates that allow the variance function to be 
misspecified. 


C10 Use the data set 401KSUBS for this exercise. 


(i) 


(ii) 


(iii) 


(av) 


Using OLS, estimate a linear probability model for e40/k, using as explanatory variables inc, 
inc’, age, age’, and male. Obtain both the usual OLS standard errors and the heteroskedasticity- 
robust versions. Are there any important differences? 

In the special case of the White test for heteroskedasticity, where we regress the squared OLS 
residuals on a quadratic in the OLS fitted values, iv on 9, $F, i = 1,...,n, argue that the prob- 
ability limit of the coefficient on ĵ; should be one, the probability limit of the coefficient on $7 
should be — 1, and the probability limit of the intercept should be zero. { Hint: Remember that 
Var(ylx,,..., x4) = p(x)[1 — p(x)], where p(x) = Bo + Bix, ++ + Bix} 

For the model estimated from part (i), obtain the White test and see if the coefficient estimates 
roughly correspond to the theoretical values described in part (ii). 

After verifying that the fitted values from part (i) are all between zero and one, obtain the 
weighted least squares estimates of the linear probability model. Do they differ in important 
ways from the OLS estimates? 


C11 Use the data in 401KSUBS for this question, restricting the sample to fsize = 1. 


G) 


(ii) 


(iii) 


(iv) 


To the model estimated in Table 8.1, add the interaction term, e40/k - inc. Estimate the equation 
by OLS and obtain the usual and robust standard errors. What do you conclude about the statis- 
tical significance of the interaction term? 

Now estimate the more general model by WLS using the same weights, 1/inc;, as in Table 8.1. 
Compute the usual and robust standard error for the WLS estimator. Is the interaction term sta- 
tistically significant using the robust standard error? 

Discuss the WLS coefficient on e40/k in the more general model. Is it of much interest by 
itself? Explain. 

Reestimate the model by WLS but use the interaction term e40/k - (inc — 30); the average 
income in the sample is about 29.44. Now interpret the coefficient on e40/k. 


C12 Use the data in MEAPO0 to answer this question. 


(i) 


(ii) 


(iii) 


(iv) 


(v) 


Estimate the model 
math4 = By + B,lunch + Bplog(enroll) + B;log(exppp) + u 


by OLS and obtain the usual standard errors and the fully robust standard errors. How do they 
generally compare? 

Apply the special case of the White test for heteroskedasticity. What is the value of the F test? 
What do you conclude? 

Obtain &; as the fitted values from the regression log(i#?) on math4,, math4?, where math, are 
the OLS fitted values and the ĉ; are the OLS residuals. Let h; = exp(@;). Use the h; to obtain 
WLS estimates. Are there big differences with the OLS coefficients? 

Obtain the standard errors for WLS that allow misspecification of the variance function. Do 
these differ much from the usual WLS standard errors? 

For estimating the effect of spending on math4, does OLS or WLS appear to be more precise? 


C13 Use the data in FERTIL2 to answer this question. 


(i) 


Estimate the model 
children = By + Bage + Bag? + B,educ + Byelectric + Burban + u 


and report the usual and heteroskedasticity-robust standard errors. Are the robust standard errors 
always bigger than the nonrobust ones? 


(ii) 


(iii) 


(iv) 
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Add the three religious dummy variables and test whether they are jointly significant. What are 
the p-values for the nonrobust and robust tests? 

From the regression in part (ii), obtain the fitted values ĵ and the residuals, à. Regress i on , $5 
and test the joint significance of the two regressors. Conclude that heteroskedasticity is present 
in the equation for children. 

Would you say the heteroskedasticity you found in part (iii) is practically important? 


C14 Use the data in BEAUTY for this question. 


(i) 


(ii) 


(iii) 


Using the data pooled for men and women, estimate the equation 
lwage = By + B,belavg + B,abvavg + B3female + Byeduc + Bsexper + Bsexper” + u, 


and report the results using heteroskedasticity-robust standard errors below coefficients. Are any 
of the coefficients surprising in either their signs or magnitudes? Is the coefficient on female 
practically large and statistically significant? 

Add interactions of female with all other explanatory variables in the equation from part (i) (five 
interactions in all). Compute the usual F test of joint significance of the five interactions and a 
heteroskedasticity-robust version. Does using the heteroskedasticity-robust version change the 
outcome in any important way? 

In the full model with interactions, determine whether those involving the looks variables— 
female ¢ belavg and female * abvavg—are jointly significant. Are their coefficients practically 
small? 


JE citer 9 


More on Specification 
and Data Issues 
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n Chapter 8, we dealt with one failure of the Gauss-Markov assumptions. While heteroskedasticity 

in the errors can be viewed as a problem with a model, it is a relatively minor one. The presence of 

heteroskedasticity does not cause bias or inconsistency in the OLS estimators. Also, it is fairly easy 
to adjust confidence intervals and ¢ and F statistics to obtain valid inference after OLS estimation, or 
even to get more efficient estimators by using weighted least squares. 

In this chapter, we return to the much more serious problem of correlation between the error, u, 
and one or more of the explanatory variables. Remember from Chapter 3 that if u is, for whatever 
reason, correlated with the explanatory variable x;, then we say that x; is an endogenous explanatory 
variable. We also provide a more detailed discussion on three reasons why an explanatory variable 
can be endogenous; in some cases, we discuss possible remedies. 

We have already seen in Chapters 3 and 5 that omitting a key variable can cause correlation 
between the error and some of the explanatory variables, which generally leads to bias and incon- 
sistency in all of the OLS estimators. In the special case that the omitted variable is a function of an 
explanatory variable in the model, the model suffers from functional form misspecification. 

We begin in the first section by discussing the consequences of functional form misspecification 
and how to test for it. In Section 9-2, we show how the use of proxy variables can solve, or at least 
mitigate, omitted variables bias. In Section 9-3, we derive and explain the bias in OLS that can arise 
under certain forms of measurement error. Additional data problems are discussed in Section 9-4. 

All of the procedures in this chapter are based on OLS estimation. As we will see, certain prob- 
lems that cause correlation between the error and some explanatory variables cannot be solved by using 


OLS on a single cross section. We postpone a treatment of alternative estimation methods until Part 3. 
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9-1 Functional Form Misspecification 


A multiple regression model suffers from functional form misspecification when it does not properly 
account for the relationship between the dependent and the observed explanatory variables. For exam- 
ple, if hourly wage is determined by log(wage) = By + B,educ + B,exper + B exper” + u, but we 
omit the squared experience term, exper, then we are committing a functional form misspecifica- 
tion. We already know from Chapter 3 that this generally leads to biased estimators of Bp, 61, and 83. 
(We do not estimate 8, because exper’ is excluded from the model.) Thus, misspecifying how exper 
affects log(wage) generally results in a biased estimator of the return to education, 8,. The amount of 
this bias depends on the size of B, and the correlation among educ, exper, and exper’. 

Things are worse for estimating the return to experience: even if we could get an unbiased estima- 
tor of B,, we would not be able to estimate the return to experience because it equals B, + 283exper 
(in decimal form). Just using the biased estimator of 6, can be misleading, especially at extreme val- 
ues of exper. 

As another example, suppose the log(wage) equation is 


log(wage) = By + Bieduc + B,exper + B3exper? 


1 
+ Byfemale + B;female-educ + u, 9-1] 


where female is a binary variable. If we omit the interaction term, female educ, then we are misspeci- 
fying the functional form. In general, we will not get unbiased estimators of any of the other param- 
eters, and because the return to education depends on gender, it is not clear what return we would be 
estimating by omitting the interaction term. 

Omitting functions of independent variables is not the only way that a model can suffer from 
misspecified functional form. For example, if (9.1) is the true model satisfying the first four Gauss- 
Markov assumptions, but we use wage rather than log(wage) as the dependent variable, then we will 
not obtain unbiased or consistent estimators of the partial effects. The tests that follow have some 
ability to detect this kind of functional form problem, but there are better tests that we will mention in 
the subsection on testing against nonnested alternatives. 

Misspecifying the functional form of a model can certainly have serious consequences. 
Nevertheless, in one important respect, the problem is minor: by definition, we have data on all the 
necessary variables for obtaining a functional relationship that fits the data well. This can be con- 
trasted with the problem addressed in the next section, where a key variable is omitted on which we 
cannot collect data. 

We already have a very powerful tool for detecting misspecified functional form: the F test for 
joint exclusion restrictions. It often makes sense to add quadratic terms of any significant variables 
to a model and to perform a joint test of significance. If the additional quadratics are significant, they 
can be added to the model (at the cost of complicating the interpretation of the model). However, 
significant quadratic terms can be symptomatic of other functional form problems, such as using the 
level of a variable when the logarithm is more appropriate, or vice versa. It can be difficult to pinpoint 
the precise reason that a functional form is misspecified. Fortunately, in many cases, using logarithms 
of certain variables and adding quadratics are sufficient for detecting many important nonlinear rela- 
tionships in economics. 


Economic Model of Crime 


Table 9.1 contains OLS estimates of the economic model of crime (see Example 8.3). We first esti- 
mate the model without any quadratic terms; those results are in column (1). 

In column (2), the squares of pcnv, ptime86, and inc86 are added; we chose to include the squares 
of these variables because each level term is significant in column (1). The variable gemp86 is a dis- 
crete variable taking on only five values, so we do not include its square in column (2). 
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TABLE 9.1 Dependent Variable: narr86 


Independent Variables (1) (2) 
pcnv —.133 553 
(.040) (.154) 
peny? — —.730 
(.156) 
avgsen —.011 OTA 
(.012) (.012) 
tottime 012 012 
(.009) (.009) 
ptime86 —.041 287 
(.009) (.004) 
ptime86? — —.0296 
(.0039) 
qemp86 — 051 —.014 
(.014) (.017) 
inc86 —.0015 —.0034 
(.0003) (.0008) 
inc86? — —.000007 
(.000003) 
black 327 292 
(.045) (.045) 
hispan 194 164 
(.040) (.039) 
intercept .569 505 
(.036) (.037) 
Observations 2,725 2,725 
R-squared 0723 1035 


Each of the squared terms is significant, and together they are jointly very significant (F = 31.37, 
with df = 3 and 2,713; the p-value is essentially zero). Thus, it appears that the initial model over- 
looked some potentially important nonlinearities. 

The presence of the quadratics makes interpret- 

GOING FURTHER 9.1 ing the model somewhat difficult. For example, pcnv 
Why do we not include the squares of black | no longer has a strict deterrent effect: the relation- 
and hispan in column (2) of Table 9.1? ship between narr86 and pcnv is positive up until 

Would it make sense to add interac- | pcnv = .365, and then the relationship is negative. 

tions of black and hispan with some of the | We might conclude that there is little or no deterrent 

other variables reported in the table? effect at lower values of pcnv; the effect only kicks 

in at higher prior conviction rates. We would have to 

use more sophisticated functional forms than the quadratic to verify this conclusion. It may be that 

pen is not entirely exogenous. For example, men who have not been convicted in the past (so that 

pcnv = 0) are perhaps casual criminals, and so they are less likely to be arrested in 1986. This could 
be biasing the estimates. 

Similarly, the relationship between narr86 and ptimeS86 is positive up until ptimeS6 = 4.85 
(almost five months in prison), and then the relationship is negative. The vast majority of men in the 
sample spent no time in prison in 1986, so again we must be careful in interpreting the results. 
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Legal income has a negative effect on narr86 until incS6 = 242.85; because income is measured 
in hundreds of dollars, this means an annual income of $24,285. Only 46 of the men in the sample 
have incomes above this level. Thus, we can conclude that narr86 and incS6 are negatively related 
with a diminishing effect. 


Example 9.1 is a tricky functional form problem due to the nature of the dependent variable. 
Other models are theoretically better suited for handling dependent variables taking on a small num- 
ber of integer values. We will briefly cover these models in Chapter 17. 


9-1a RESET as a General Test for Functional 
Form Misspecification 


Some tests have been proposed to detect general functional form misspecification. Ramsey’s (1969) 
regression specification error test (RESET) has proven to be useful in this regard. 
The idea behind RESET is fairly simple. If the original model 


y = Bot Bx to + petu [9.2] 


satisfies MLR.4, then no nonlinear functions of the independent variables should be significant when 
added to equation (9.2). In Example 9.1, we added quadratics in the significant explanatory vari- 
ables. Although this often detects functional form problems, it has the drawback of using up many 
degrees of freedom if there are many explanatory variables in the original model (much as the straight 
form of the White test for heteroskedasticity consumes degrees of freedom). Further, certain kinds of 
neglected nonlinearities will not be picked up by adding quadratic terms. RESET adds polynomials 
in the OLS fitted values to equation (9.2) to detect general kinds of functional form misspecification. 

To implement RESET, we must decide how many functions of the fitted values to include in an 
expanded regression. There is no right answer to this question, but the squared and cubed terms have 
proven to be useful in most applications. 

Let } denote the OLS fitted values from estimating (9.2). Consider the expanded equation 


y = Bo + Bix, +++ + Bay, + 6% + 6,9? + error. [9.3] 


This equation seems a little odd, because functions of the fitted values from the initial estimation now 
appear as explanatory variables. In fact, we will not be interested in the estimated parameters from 
(9.3); we only use this equation to test whether (9.2) has missed important nonlinearities. The thing to 
remember is that $” and $* are just nonlinear functions of the Xj. 

The null hypothesis is that (9.2) is correctly specified. Thus, RESET is the F statistic for test- 
ing Hp: ô, = 0, 6, = 0 in the expanded model (9.3). A significant F statistic suggests some sort of 
functional form problem. The distribution of the F statistic is approximately F’,,,_,_3 in large samples 
under the null hypothesis (and the Gauss-Markov assumptions). The df in the expanded equation (9.3) 
isn —k— 1—2 = n — k — 3. An LM version is also available (and the chi-square distribution will 
have two df). Further, the test can be made robust to heteroskedasticity using the methods discussed 
in Section 8-2. 


Housing Price Equation 
We estimate two models for housing prices. The first one has all variables in level form: 
price = Bo + B,lotsize + Bosqrft + B3bdrms + u. [9.4] 
The second one uses the logarithms of all variables except bdrms: 


log(price) = Bo + Byllotsize + Bylsqrft + B3bdrms + u. [9.5] 
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Using n = 88 houses in HPRICE1, the RESET statistic for equation (9.4) turns out to be 4.67; this 
is the value of an F, g, random variable (n = 88, k = 3), and the associated p-value is .012. This is 
evidence of functional form misspecification in (9.4). 

The RESET statistic in (9.5) is 2.56, with p-value = .084. Thus, we do not reject (9.5) at the 5% 
significance level (although we would at the 10% level). On the basis of RESET, the log-log model in 
(9.5) is preferred. 


In the previous example, we tried two models for explaining housing prices. One was rejected by 
RESET, while the other was not (at least at the 5% level). Often, things are not so simple. A drawback 
with RESET is that it provides no real direction on how to proceed if the model is rejected. Rejecting 
(9.4) by using RESET does not immediately suggest that (9.5) is the next step. Equation (9.5) was 
estimated because constant elasticity models are easy to interpret and can have nice statistical proper- 
ties. In this example, it so happens that it passes the functional form test as well. 

Some have argued that RESET is a very general test for model misspecification, including unob- 
served omitted variables and heteroskedasticity. Unfortunately, such use of RESET is largely mis- 
guided. It can be shown that RESET has no power for detecting omitted variables whenever they have 
expectations that are linear in the included independent variables in the model [see Wooldridge (2001, 
Section 2-1) for a precise statement]. Further, if the functional form is properly specified, RESET has 
no power for detecting heteroskedasticity. The bottom line is that RESET is a functional form test, 
and nothing more. 


9-1b Tests against Nonnested Alternatives 


Obtaining tests for other kinds of functional form misspecification—for example, trying to decide 
whether an independent variable should appear in level or logarithmic form—takes us outside the 
realm of classical hypothesis testing. It is possible to test the model 


y = Bo + Bix, + Boxy + u [9.6] 


against the model 


y = Bo + Bilog(x,) + Bolog(x,) + u, [9.7] 


and vice versa. However, these are nonnested models (see Chapter 6), and so we cannot simply use a 
standard F test. Two different approaches have been suggested. The first is to construct a comprehen- 
sive model that contains each model as a special case and then to test the restrictions that led to each 
of the models. In the current example, the comprehensive model is 


Y= Yo + Vix, + Vox. + yslog(x,) + yalog(x2) + u. [9.8] 


We can first test Hp: y3 = 0, y4 = 0 as a test of (9.6). We can also test Hp: yı = 0, y2 = 0 as a test of 
(9.7). This approach was suggested by Mizon and Richard (1986). 

Another approach has been suggested by Davidson and MacKinnon (1981). They point out that if 
model (9.6) holds with E(u|x,, x2) = 0, the fitted values from the other model, (9.7), should be insig- 
nificant when added to equation (9.6). Therefore, to test whether (9.6) is the correct model, we first 
estimate model (9.7) by OLS to obtain the fitted values; call these ý. The Davidson-MacKinnon test 
is obtained from the f statistic on y in the auxiliary equation 


y = Bo + Bixi + Box. + 0,9 + error. 
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Because the ý are just nonlinear functions of x, and x5, they should be insignificant if (9.6) is the cor- 
rect conditional mean model. Therefore, a significant ¢ statistic (against a two-sided alternative) is a 
rejection of (9.6). 

Similarly, if ĵ denotes the fitted values from estimating (9.6), the test of (9.7) is the f statistic on 
ĵ in the model 


y = Bo + Bilog(x,) + Bolog(x,) + 0, + error; 


a significant f statistic is evidence against (9.7). The same two tests can be used for testing any two 
nonnested models with the same dependent variable. 

There are a few problems with nonnested testing. First, a clear winner need not emerge. Both 
models could be rejected or neither model could be rejected. In the latter case, we can use the 
adjusted R-squared to choose between them. If both models are rejected, more work needs to be done. 
However, it is important to know the practical consequences from using one form or the other: if the 
effects of key independent variables on y are not very different, then it does not really matter which 
model is used. 

A second problem is that rejecting (9.6) using, say, the Davidson-MacKinnon test does not 
mean that (9.7) is the correct model. Model (9.6) can be rejected for a variety of functional form 
misspecifications. 

An even more difficult problem is obtaining nonnested tests when the competing models have 
different dependent variables. The leading case is y versus log(y). We saw in Chapter 6 that just 
obtaining goodness-of-fit measures that can be compared requires some care. Tests have been pro- 
posed to solve this problem, but they are beyond the scope of this text. [See Wooldridge (1994a) for a 
test that has a simple interpretation and is easy to implement.] 


9-2 Using Proxy Variables for Unobserved Explanatory Variables 


A more difficult problem arises when a model excludes a key variable, usually because of data una- 
vailability. Consider a wage equation that explicitly recognizes that ability (abil) affects log( wage): 


log(wage) = By + Bieduc + B,exper + B,abil + u. [9.9] 


This model shows explicitly that we want to hold ability fixed when measuring the return to educ and 
exper. If, say, educ is correlated with abil, then putting abil in the error term causes the OLS estimator 
of G, (and B,) to be biased, a theme that has appeared repeatedly. 

Our primary interest in equation (9.9) is in the slope parameters 6, and 6,. We do not really care 
whether we get an unbiased or consistent estimator of the intercept By; as we will see shortly, this is 
not usually possible. Also, we can never hope to estimate 6; because abil is not observed; in fact, we 
would not know how to interpret 6, anyway, because ability is at best a vague concept. 

How can we solve, or at least mitigate, the omitted variables bias in an equation like (9.9)? One 
possibility is to obtain a proxy variable for the omitted variable. Loosely speaking, a proxy variable 
is something that is related to the unobserved variable that we would like to control for in our analy- 
sis. In the wage equation, one possibility is to use the intelligence quotient, or IQ, as a proxy for abil- 
ity. This does not require IQ to be the same thing as ability; what we need is for IQ to be correlated 
with ability, something we clarify in the following discussion. 

All of the key ideas can be illustrated in a model with three independent variables, two of which 
are observed: 


y = Bo + Bix, + Box. + B3x3 + u. [9.10] 
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We assume that data are available on y, x,, and x,—in the wage example, these are log( wage), educ, 
and exper, respectively. The explanatory variable x3 is unobserved, but we have a proxy variable for 
x3. Call the proxy variable x3. 

What do we require of x,? At a minimum, it should have some relationship to x3. This is captured 
by the simple regression equation 


x3 = ĝo + 64x3 + v3, [9.11] 


where v; is an error due to the fact that x3 and x, are not exactly related. The parameter ô, measures 
the relationship between x3 and x3; typically, we think of x; and x3 as being positively related, so 
that ô; > 0. If 6; = 0, then x, is not a suitable proxy for x3. The intercept 6 in (9.11), which can 
be positive or negative, simply allows x; and x; to be measured on different scales. (For exam- 
ple, unobserved ability is certainly not required to have the same average value as IQ in the U.S. 
population.) 

How can we use x; to get unbiased (or at least consistent) estimators of 8; and B,? The proposal 
is to pretend that x; and x3 are the same, so that we run the regression of 


y ON Xi, X2, X3. [9.12] 


We call this the plug-in solution to the omitted variables problem because x; is just plugged in for 
x3; before we run OLS. If x; is truly related to x3, this seems like a sensible thing. However, because x3 
and x; are not the same, we should determine when this procedure does in fact give consistent estima- 
tors of 6; and B>. 

The assumptions needed for the plug-in solution to provide consistent estimators of 6, and 6, can 
be broken down into assumptions about u and v3: 

(1) The error u is uncorrelated with x,, x2, and x3, which is just the standard assumption in model 
(9.10). In addition, u is uncorrelated with x;. This latter assumption just means that x; is irrelevant in 
the population model, once x, x2, and x3 have been included. This is essentially true by definition, as 
x; is a proxy variable for x3: it is x3 that directly affects y, not x3. Thus, the assumption that u is uncor- 
related with x,, X2, x3, and x; is not very controversial. (Another way to state this assumption is that 
the expected value of u, given all these variables, is zero.) 

(2) The error v3 is uncorrelated with x), x2, and x3. Assuming that v3 is uncorrelated with x, and x, 
requires x; to be a “good” proxy for x3. This is easiest to see by writing the analog of these assump- 
tions in terms of conditional expectations: 


E(x3121,2,3) a E(x3|x3) = bo + 65%3. [9.13] 


The first equality, which is the most important one, says that, once x, is controlled for, the expected 
value of x; does not depend on x, or x . Alternatively, x; has zero correlation with x, and x, once x; 
is partialled out. 

In the wage equation (9.9), where JQ is the proxy for ability, condition (9.13) becomes 


E(abilleduc, exper,IQ) = E(abillIQ) = 6) + 6310. 


Thus, the average level of ability only changes with JQ, not with educ and exper. Is this reasonable? 
Maybe it is not exactly true, but it may be close to being true. It is certainly worth including /Q in the 
wage equation to see what happens to the estimated return to education. 

We can easily see why the previous assumptions are enough for the plug-in solution to work. If 
we plug equation (9.11) into equation (9.10) and do simple algebra, we get 


y= (Bo F B380) + Bix, + Box, + B363x; + u + B3v3. 
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Call the composite error in this equation e = u + ß3v3; it depends on the error in the model of inter- 
est, (9.10), and the error in the proxy variable equation, v}. Because u and v, both have zero mean and 
each is uncorrelated with x,, x2, and x3, e also has zero mean and is uncorrelated with x,, x», and x3. 
Write this equation as 


y = Ay + Bix, + Bory + ax + e, 


where a = (Bo + B35p) is the new intercept and a; = B;6; is the slope parameter on the proxy 
variable x;. As we alluded to earlier, when we run the regression in (9.12), we will not get unbiased 
estimators of By and 3; instead, we will get unbiased (or at least consistent) estimators of a, B1, B2, 
and a3. The important thing is that we get good estimates of the parameters 6, and 63. 

In most cases, the estimate of œ; is actually more interesting than an estimate of B3; anyway. For 
example, in the wage equation, œ, measures the return to wage given one more point on IQ score. 


IQ as a Proxy for Ability 


The file WAGE2, from Blackburn and Neumark (1992), contains information on monthly earnings, 
education, several demographic variables, and IQ scores for 935 men in 1980. As a method to account 
for omitted ability bias, we add JQ to a standard log wage equation. The results are shown in Table 9.2. 

Our primary interest is in what happens to the estimated return to education. Column (1) contains 
the estimates without using /Q as a proxy variable. The estimated return to education is 6.5%. If we 
think omitted ability is positively correlated with educ, then we assume that this estimate is too high. 
(More precisely, the average estimate across all random samples would be too high.) When ZQ is 


TABLE 9.2 Dependent Variable: log(wage) 


Independent Variables (1) (2) (3) 
educ .065 054 .018 
(.006) (.007) (.041) 
exper 014 014 014 
(.003) (.003) (.003) 
tenure 012 011 011 
(.002) (.002) (.002) 
married 199 .200 .201 
(.039) (.039) (.039) 
south —.091 —.080 —.080 
(.026) (.026) (.026) 
urban 184 182 184 
(.027) (.027) (.027) 
black —.188 THA See 
(.038) (.039) (.040) 
IQ — .0036 —.0009 
(.0010) (.0052) 
educ: IQ = — .00034 
(.00038) 
intercept 5.395 5.176 5.648 
(.113) (.128) (.546) 
Observations 935 935 935 
R-squared 1253 .263 .263 
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added to the equation, the return to education falls to 5.4%, which corresponds with our prior beliefs 
about omitted ability bias. 

The effect of IQ on socioeconomic outcomes has been documented in the controversial book The 
Bell Curve, by Herrnstein and Murray (1994). Column (2) shows that IQ does have a statistically sig- 
nificant, positive effect on earnings, after controlling for several other factors. Everything else being 
equal, an increase of 10 IQ points is predicted to raise monthly earnings by 3.6%. The standard devia- 
tion of IQ in the U.S. population is 15, so a one standard deviation increase in IQ is associated with 
higher earnings of 5.4%. This is identical to the predicted increase in wage due to another year of 
education. It is clear from column (2) that education still has an important role in increasing earnings, 
even though the effect is not as large as originally estimated. 

Some other interesting observations emerge from columns (1) and (2). Adding /Q to the equation 
only increases the R-squared from .253 to .263. Most of the variation in log( wage) is not explained by 
the factors in column (2). Also, adding JQ to the equation does not eliminate the estimated earnings dif- 
ference between black and white men: a black man with the same IQ, education, experience, and so on, 
as a white man is predicted to earn about 14.3% less, and the difference is very statistically significant. 

Column (3) in Table 9.2 includes the interaction 

GOING FURTHER 9.2 term educ-IQ. This allows for the possibility that 

educ and abil interact in determining log( wage). We 

might think that the return to education is higher for 

people with more ability, but this turns out not to be 

the case: the interaction term is not significant, and 

its addition makes educ and IQ individually insignif- 

icant while complicating the model. Therefore, the 
estimates in column (2) are preferred. 

There is no reason to stop at a single proxy variable for ability in this example. The data set 
WAGE? also contains a score for each man on the Knowledge of the World of Work (KWW) test. This 
provides a different measure of ability, which can be used in place of IQ or along with IQ, to estimate 
the return to education (see Computer Exercise C2). 


What do you make of the small and 
statistically insignificant coefficient on educ 


in column (8) of Table 9.2? (Hint: When 
educ: IQ is in the equation, what is the 
interpretation of the coefficient on educ?) 


It is easy to see how using a proxy variable can still lead to bias if the proxy variable does not sat- 
isfy the preceding assumptions. Suppose that, instead of (9.11), the unobserved variable, x3, is related 
to all of the observed variables by 


X3 = ĝo + Oxy + ôx + 63x; + v3, [9.14] 


where v3 has a zero mean and is uncorrelated with x,, x2, and x3. Equation (9.11) assumes that 6, 
and 6, are both zero. By plugging equation (9.14) into (9.10), we get 


= (Bo + B36) T (Bı + B381)xı + (B2 F B3ô82)x2 


+ B363x3 + u + B3v3, [9.15] 


from which it follows that plim(B,) = B, + B36, and plim(f,) = B + B38). [This follows because 
the error in (9.15), u + 633, has zero mean and is uncorrelated with x,, x2, and x3.] In the previous 
example where x, = educ and x3; = abil, B, > 0, so there is a positive bias (inconsistency) if abil has 
a positive partial correlation with educ (6, > 0). Thus, we could still be getting an upward bias in the 
return to education by using /Q as a proxy for abil if IQ is not a good proxy. But we can reasonably 
hope that this bias is smaller than if we ignored the problem of omitted ability entirely. 

A complaint that is sometimes aired about including variables such as /Q in a regression that 
includes educ is that it exacerbates the problem of multicollinearity, likely leading to a less precise 
estimate of B,.,,,-. But this complaint misses two important points. First, the inclusion of JQ reduces the 
error variance because the part of ability explained by JQ has been removed from the error. Typically, 
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this will be reflected in a smaller standard error of the regression (although it need not get smaller 
because of its degrees-of-freedom adjustment). Second, and most importantly, the added multicol- 
linearity is a necessary evil if we want to get an estimator of 8,,,,. With less bias: the reason educ and 
IQ are correlated is that educ and abil are thought to be correlated, and /Q is a proxy for abil. If we 
could observe abil we would include it in the regression, and of course, there would be unavoidable 
multicollinearity caused by correlation between educ and abil. 

Proxy variables can come in the form of binary information as well. In Example 7.9 [see equation 
(7.15)], we discussed Krueger’s (1993) estimates of the return to using a computer on the job. Krueger 
also included a binary variable indicating whether the worker uses a computer at home (as well as 
an interaction term between computer usage at work and at home). His primary reason for including 
computer usage at home in the equation was to proxy for unobserved “technical ability” that could 
affect wage directly and be related to computer usage at work. 


9-2a Using Lagged Dependent Variables as Proxy Variables 


In some applications, like the earlier wage example, we have at least a vague idea about which unob- 
served factor we would like to control for. This facilitates choosing proxy variables. In other applica- 
tions, we suspect that one or more of the independent variables is correlated with an omitted variable, 
but we have no idea how to obtain a proxy for that omitted variable. In such cases, we can include, as 
a control, the value of the dependent variable from an earlier time period. This is especially useful for 
policy analysis. 

Using a lagged dependent variable in a cross-sectional equation increases the data require- 
ments, but it also provides a simple way to account for historical factors that cause current differences 
in the dependent variable that are difficult to account for in other ways. For example, some cities have 
had high crime rates in the past. Many of the same unobserved factors contribute to both high current 
and past crime rates. Likewise, some universities are traditionally better in academics than other uni- 
versities. Inertial effects are also captured by putting in lags of y. 

Consider a simple equation to explain city crime rates: 


crime = By + B\unem + B,expend + Bcrime_, + u, [9.16] 


where crime is a measure of per capita crime, unem is the city unemployment rate, expend is per cap- 
ita spending on law enforcement, and crime_, indicates the crime rate measured in some earlier year 
(this could be the past year or several years ago). We are interested in the effects of unem on crime, as 
well as of law enforcement expenditures on crime. 

What is the purpose of including crime_, in the equation? Certainly, we expect that B; > 0 
because crime has inertia. But the main reason for putting this in the equation is that cities with 
high historical crime rates may spend more on crime prevention. Thus, factors unobserved by us (the 
econometricians) that affect crime are likely to be correlated with expend (and unem). If we use a 
pure cross-sectional analysis, we are unlikely to get an unbiased estimator of the causal effect of law 
enforcement expenditures on crime. But, by including crime_, in the equation, we can at least do the 
following experiment: if two cities have the same previous crime rate and current unemployment rate, 
then 6, measures the effect of another dollar of law enforcement on crime. 


EXAMPLE 9.4 City Crime Rates 


We estimate a constant elasticity version of the crime model in equation (9.16) (unem, because it is 
a percentage, is left in level form). The data in CRIME2 are from 46 cities for the year 1987. The 
crime rate is also available for 1982, and we use that as an additional independent variable in trying to 
control for city unobservables that affect crime and may be correlated with current law enforcement 
expenditures. Table 9.3 contains the results. 
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TABLE 9.3 Dependent Variable: log(crmrte,,) 


Independent Variables (1) (2) 
unem —.029 .009 
(.032) (.020) 
log( lawexpCs7) .203 —.140 
(.173) (.109) 
log( crmrtes.) = 1.194 
(.132) 
intercept 3.34 .076 
(1.25) (.821) 

Observations 46 46 
R-squared .057 .680 


Without the lagged crime rate in the equation, the effects of the unemployment rate and expendi- 
tures on law enforcement are counterintuitive; neither is statistically significant, although the ¢ statis- 
tic on log(lawexpcg7) is 1.17. One possibility is that increased law enforcement expenditures improve 
reporting conventions, and so more crimes are reported. But it is also likely that cities with high 
recent crime rates spend more on law enforcement. 

Adding the log of the crime rate from five years earlier has a large effect on the expenditures coef- 
ficient. The elasticity of the crime rate with respect to expenditures becomes —.14, with £ = — 1.28. 
This is not strongly significant, but it suggests that a more sophisticated model with more cities in the 
sample could produce significant results. 

Not surprisingly, the current crime rate is strongly related to the past crime rate. The estimate 
indicates that if the crime rate in 1982 was 1% higher, then the crime rate in 1987 is predicted to be 
about 1.19% higher. We cannot reject the hypothesis that the elasticity of current crime with respect to 
past crime is unity [t = (1.194 — 1)/.132 = 1.47]. Adding the past crime rate increases the explana- 
tory power of the regression markedly, but this is no surprise. The primary reason for including the 
lagged crime rate is to obtain a better estimate of the ceteris paribus effect of log(Jawexpcg7) on 
log(crmrteg7). 


The practice of putting in a lagged y as a general way of controlling for unobserved variables is 
hardly perfect. But it can aid in getting a better estimate of the effects of policy variables on various 
outcomes. When the data are available, additional lags also can be included. 

Adding lagged values of y is not the only way to use two years of data to control for omitted 
factors. When we discuss panel data methods in Chapters 13 and 14, we will cover other ways to use 
repeated data on the same cross-sectional units at different points in time. 


9-2b A Different Slant on Multiple Regression 


The discussion of proxy variables in this section suggests an alternative way of interpreting a multiple 
regression analysis when we do not necessarily observe all relevant explanatory variables. Until now, 
we have specified the population model of interest with an additive error, as in equation (9.9). Our dis- 
cussion of that example hinged upon whether we have a suitable proxy variable (IQ score in this case, 
other test scores more generally) for the unobserved explanatory variable, which we called “ability.” 
A less structured, more general approach to multiple regression is to forego specifying models 
with unobservables. Rather, we begin with the premise that we have access to a set of observable 
explanatory variables—which includes the variable of primary interest, such as years of schooling, 
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and controls, such as observable test scores. We then model the mean of y conditional on the observed 
explanatory variables. For example, in the wage example with wage denoting log(wage), we 
can estimate E(lwageleduc, exper, tenure, south, urban, black,IQ)—exactly what is reported in 
Table 9.2. The difference now is that we set our goals more modestly. Namely, rather than introduce 
the nebulous concept of “ability” in equation (9.9), we state from the outset that we will estimate the 
ceteris paribus effect of education holding /Q (and the other observed factors) fixed. There is no need 
to discuss whether /Q is a suitable proxy for ability. Consequently, while we may not be answering 
the question underlying equation (9.9), we are answering a question of interest: if two people have the 
same /Q levels (and same values of experience, tenure, and so on), yet they differ in education levels 
by a year, what is the expected difference in their log wages? 

As another example, if we include as an explanatory variable the poverty rate in a school-level 
regression to assess the effects of spending on standardized test scores, we should recognize that the 
poverty rate only crudely captures the relevant differences in children and parents across schools. But 
often it is all we have, and it is better to control for the poverty rate than to do nothing because we 
cannot find suitable proxies for student “ability,” parental “involvement,” and so on. Almost certainly 
controlling for the poverty rate gets us closer to the ceteris paribus effects of spending than if we leave 
the poverty rate out of the analysis. 

In some applications of regression analysis, we are interested simply in predicting the outcome, y, 
given a set of explanatory variables, (x4, . . . , Xp). In such cases, it makes little sense to think in terms 
of “bias” in estimated coefficients due to omitted variables. Instead, we should focus on obtaining 
a model that predicts as well as possible, and make sure we do not include as regressors variables 
that cannot be observed at the time of prediction. For example, an admissions officer for a college or 
university might be interested in predicting success in college, as measured by grade point average, 
in terms of variables that can be measured at application time. Those variables would include high 
school performance (maybe just grade point average, but perhaps performance in specific kinds of 
courses), standardized test scores, participation in various activities (such as debate or math club), and 
even family background variables. We would not include a variable measuring college class attend- 
ance because we do not observe attendance in college at application time. Nor would we wring our 
hands about potential “biases” caused by omitting an attendance variable: we have no interest in, say, 
measuring the effect of high school GPA holding attendance in college fixed. Likewise, we would not 
worry about biases in coefficients because we cannot observe factors such as motivation. Naturally, 
for predictive purposes it would probably help substantially if we had a measure of motivation, but in 
its absence we fit the best model we can with observed explanatory variables. 


9-2c Potential Outcomes and Proxy Variables 


The notion of proxy variables can be related to the potential outcomes framework that we introduced 
in Sections 2-7, 3-7, and 4-7, covered in more generality in Section 7-6. Let y(0) and y(1) denote 
the potential outcomes and w the binary treatment indicator. When we include explanatory variables 
X = (Xi, X),..., X4) in a regression that includes w as an explanatory variable, one way to think about 
what we are doing is we are using x as a set of proxy variables for the unobserved factors that affect 
the potential outcomes, y(0) and y(1), and might also be related to the participation decision (w = 1 
or w = 0). Write 


y(0) = mo + v(0) 
yA) = u; + v1) 


where My and u are the two counterfactual means and T,,, = Mı — Mo is the average treatment effect. 
The problem of selection into participation means that w can be related to the unobservables v(0) and 
v(1). The ignorability or unconfoundedness assumption discussed in Sections 3-7 and 7-6 is that, 


306 PART1 Regression Analysis with Cross-Sectional Data 


conditional on x, w is independent of [v(0), v(1)]. This is essentially the assumption that the elements 
of x act as suitable proxies for the unobservables. Assuming linear functional forms as in Section 7-6, 


E[v(0)|w,x] = Elv(0)Ix] = (x — 9) Bo and 
Elv(1)|w,x] = E[v(1)|x] = (x = 9) B,, 


where ņ = E(x). The first equalities in both equations, which we recognize as implications of the 
conditional independence or unconfoundedness assumption, is effectively the same as the proxy vari- 
able condition: conditional on x, w is unrelated to unobserved factors that affect [y(O),y(1)]. From 
Section 7-6 we know that unconfoundedness plus the linear functional form leads to a regression with 
interaction terms, 


y; On W, X, w;* (x — X)i=1,...,n, 


where x is the vector of sample averages and the regression is across the entire sample. The coeffi- 
cient on w;is 7,,., the estimate of the average treatment effect. See Section 7-6 for further discussion. 


9-3 Models with Random Slopes 


In our treatment of regression so far, we have assumed that the slope coefficients are the same across 
individuals in the population, or that, if the slopes differ, they differ by measurable characteristics, in 
which case we are led to regression models containing interaction terms. For example, as we saw in 
Section 7-4, we can allow the return to education to differ by men and women by interacting educa- 
tion with a gender dummy in a log wage equation. 

Here we are interested in a related but different question: What if the partial effect of a variable 
depends on unobserved factors that vary by population unit? If we have only one explanatory vari- 
able, x, we can write a general model (for a random draw, i, from the population, for emphasis) as 


yi = a, + bx, [9.17] 


where a; is the intercept for unit 7 and b; is the slope. In the simple regression model from Chapter 2 
we assumed b; = B and labeled a; as the error, u;. The model in (9.17) is sometimes called a random 
coefficient model or random slope model because the unobserved slope coefficient, b;, is viewed as 
a random draw from the population along with the observed data, (x;,y;), and the unobserved inter- 
cept, a; As an example, if y; = log(wage;) and x; = educ;, then (9.17) allows the return to education, 
b;, to vary by person. If, say, b; contains unmeasured ability (just as a; would), the partial effect of 
another year of schooling can depend on ability. 

With a random sample of size n, we (implicitly) draw n values of b; along with n values of a; 
(and the observed data on x and y). Naturally, we cannot estimate a slope—or, for that matter, an 
intercept—for each i. But we can hope to estimate the average slope (and average intercept), where 
the average is across the population. Therefore, define a = E(a;) and 8 = E(b;). Then £ is the aver- 
age of the partial effect of x on y, and so we call £ the average partial effect (APE), or the average 
marginal effect (AME). In the context of a log wage equation, £ is the average return to a year of 
schooling in the population. 

If we write a; = a + c; and b; = B + d,, then d; is the individual-specific deviation from the 
APE. By construction, E(c;) = 0 and E(d;) = 0. Substituting into (9.17) gives 


y=at Bx, + ci + dx; =a + Bx + u; [9.18] 


where u; = c; + d,x;. (To make the notation easier to follow, we now use a, the mean value of a;, as 
the intercept, and $, the mean of b, as the slope.) In other words, we can write the random coefficient 
as a constant coefficient model but where the error term contains an interaction between an unobserv- 
able, d;, and the observed explanatory variable, x;. 
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When would a simple regression of y; on x; provide an unbiased estimate of 6 (and a)? We can 
apply the result for unbiasedness from Chapter 2. If E(u;|x;) = 0, then OLS is generally unbiased. 
When u; = c; + dx, sufficient is E(c,|x,) = E(c;) = 0 and E(dj|x;) = E(d;) = 0. We can write these 
in terms of the unit-specific intercept and slope as 


E(aj|x;) = E(a;) and E(bix;) = E(b;); [9.19] 


that is, a; and b; are both mean independent of x;. This is a useful finding: if we allow for unit-specific 
slopes, OLS consistently estimates the population average of those slopes when they are mean inde- 
pendent of the explanatory variable. (See Problem 6 for a weaker set of conditions that imply consist- 
ency of OLS.) 

The error term in (9.18) almost certainly contains heteroskedasticity. In fact, if Var(c,x;) = o, 
Var(dix;) = o}, and Cov(c;,d|x;) = 0, then 


Var(ujx;) = 02 + oye, [9.20] 


and so there must be heteroskedasticity in u; unless 77 = 0, which means b; = 8 for all i. We know 
how to account for heteroskedasticity of this kind. We can use OLS and compute heteroskedas- 
ticity-robust standard errors and test statistics, or we can estimate the variance function in (9.20) 
and apply weighted least squares. Of course, the latter strategy imposes homoskedasticity on the 
random intercept and slope, and so we would want to make a WLS analysis fully robust to viola- 
tions of (9.20). 

Because of equation (9.20), some authors like to view heteroskedasticity in regression models 
generally as arising from random slope coefficients. But we should remember that the form of (9.20) 
is special, and it does not allow for heteroskedasticity in a; or b;. We cannot convincingly distinguish 
between a random slope model, where the intercept and slope are independent of x;, and a constant 
slope model with heteroskedasticity in q;. 

The treatment for multiple regression is similar. Generally, write 


Yi = d; + baXa + boxo +> + Darn [9.21] 
Then, by writing a; = œ + c; and b; = $; + dj, we have 


yi = a + Bixa + + Byte + Up [9.22] 


where u; = c; + daxa +++: + dyx,. If we maintain the mean independence assumptions 
E(a;lx;) = E(a,) and E(b,|x;) = E(b,),j = 1,...,, then E(yjx;) = @ + Bixa + + BX and 
so OLS using a random sample produces unbiased estimators of œ and the 6;. As in the simple regres- 
sion case, Var(u,|x;) is almost certainly heteroskedastic. 

We can allow the b; to depend on observable explanatory variables as well as unob- 
servables. For example, suppose with k = 2 the effect of xp depends on x;,, and we write 
ba = Ba + (xı — u) + dp, where u, = E(x,,). If we assume E(d,,|x;) = 0 (and similarly for c; 
and d,,), then E(y,\x;,, X2) = @ + Bixa + Box + (xa — Mı)xp, which means we have an interac- 
tion between x; and x. Because we have subtracted the mean u from x;,, B» is the APE of xp. 

The bottom line of this section is that allowing for random slopes is fairly straightforward if the 
slopes are independent, or at least mean independent, of the explanatory variables. In addition, we can 
easily model the slopes as functions of the exogenous variables, which leads to models with squares 
and interactions. Of course, in Chapter 6 we discussed how such models can be useful without ever 
introducing the notion of a random slope. The random slopes specification provides a separate justi- 
fication for such models. Estimation becomes considerably more difficult if the random intercept as 
well as some slopes are correlated with some of the regressors. We cover the problem of endogenous 
explanatory variables in Chapter 15. 
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9-4 Properties of OLS under Measurement Error 


Sometimes, in economic applications, we cannot collect data on the variable that truly affects eco- 
nomic behavior. A good example is the marginal income tax rate facing a family that is trying to 
choose how much to contribute to charity in a given year. The marginal rate may be hard to obtain or 
summarize as a single number for all income levels. Instead, we might compute the average tax rate 
based on total income and tax payments. 

When we use an imprecise measure of an economic variable in a regression model, then our model 
contains measurement error. In this section, we derive the consequences of measurement error for ordi- 
nary least squares estimation. OLS will be consistent under certain assumptions, but there are others 
under which it is inconsistent. In some of these cases, we can derive the size of the asymptotic bias. 

As we will see, the measurement error problem has a similar statistical structure to the omit- 
ted variable—proxy variable problem discussed in the previous section, but they are conceptually dif- 
ferent. In the proxy variable case, we are looking for a variable that is somehow associated with 
the unobserved variable. In the measurement error case, the variable that we do not observe has a 
well-defined, quantitative meaning (such as a marginal tax rate or annual income), but our recorded 
measures of it may contain error. For example, reported annual income is a measure of actual annual 
income, whereas IQ score is a proxy for ability. 

Another important difference between the proxy variable and measurement error problems is 
that, in the latter case, often the mismeasured independent variable is the one of primary interest. In 
the proxy variable case, the partial effect of the omitted variable is rarely of central interest: we are 
usually concerned with the effects of the other independent variables. 

Before we consider details, we should remember that measurement error is an issue only when 
the variables for which the econometrician can collect data differ from the variables that influence 
decisions by individuals, families, firms, and so on. 


9-4a Measurement Error in the Dependent Variable 


We begin with the case where only the dependent variable is measured with error. Let y* denote the 
variable (in the population, as always) that we would like to explain. For example, y“ could be annual 
family savings. The regression model has the usual form 


Y= Bo + Bux +--+ + Be + u, [9.23] 


and we assume it satisfies the Gauss-Markov assumptions. We let y represent the observable measure 
of y”. In the savings case, y is reported annual savings. Unfortunately, families are not perfect in their 
reporting of annual family savings; it is easy to leave out categories or to overestimate the amount 
contributed to a fund. Generally, we can expect y and y* to differ, at least for some subset of families 
in the population. 

The measurement error (in the population) is defined as the difference between the observed 
value and the actual value: 


e =y- y. [9.24] 


For a random draw i from the population, we can write ep = y; — y;, but the important thing is how 
the measurement error in the population is related to other factors. To obtain an estimable model, we 
write y* = y — ep, plug this into equation (9.23), and rearrange: 


y = Bo + Bix +++ + Bey tu t ep. [9.25] 


The error term in equation (9.25) is u + eọ. Because y, x1, X%2,..., X, are observed, we can estimate this 
model by OLS. In effect, we just ignore the fact that y is an imperfect measure of y“ and proceed as usual. 
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When does OLS with y in place of y“ produce consistent estimators of the B;? Because the origi- 
nal model (9.23) satisfies the Gauss-Markov assumptions, u has zero mean and is uncorrelated with 
each x;. It is only natural to assume that the measurement error has zero mean; if it does not, then 
we simply get a biased estimator of the intercept, By, which is rarely a cause for concern. Of much 
more importance is our assumption about the relationship between the measurement error, eg, and the 
explanatory variables, x;. The usual assumption is that the measurement error in y is statistically inde- 
pendent of each explanatory variable. If this is true, then the OLS estimators from (9.25) are unbi- 
ased and consistent. Further, the usual OLS inference procedures (t, F, and LM statistics) are valid. 

If ey) and u are uncorrelated, as is usually assumed, then Var(u + eo) = o} + 03 > 02. This 
means that measurement error in the dependent variable results in a larger error variance than when 
no error occurs; this, of course, results in larger variances of the OLS estimators. This is to be 
expected and there is nothing we can do about it (except collect better data). The bottom line is that, if 
the measurement error is uncorrelated with the independent variables, then OLS estimation has good 
properties. 


Savings Function with Measurement Error 


Consider a savings function 
sav" = By + Biinc + Bosize + B,educ + Byage + u, 


but where actual savings (sav*) may deviate from reported savings (sav). The question is whether 
the size of the measurement error in sav is systematically related to the other variables. It might be 
reasonable to assume that the measurement error is not correlated with inc, size, educ, and age. On the 
other hand, we might think that families with higher incomes, or more education, report their savings 
more accurately. We can never know whether the measurement error is correlated with inc or educ, 
unless we can collect data on sav"; then, the measurement error can be computed for each observation 
as € = sav; — savř. 


When the dependent variable is in logarithmic form, so that log(y*) is the dependent variable, it 
is natural for the measurement error equation to be of the form 


log(y) = log(y*) + ev. [9.26] 
This follows from a multiplicative measurement error for y: y = y'dj, where dj > 0 and 
ey = log(ay). 
EXAMPLE 9.6 Measurement Error in Scrap Rates 


In Section 7-6, we discussed an example in which we wanted to determine whether job training grants 
reduce the scrap rate in manufacturing firms. We certainly might think the scrap rate reported by firms 
is measured with error. (In fact, most firms in the sample do not even report a scrap rate.) In a simple 
regression framework, this is captured by 


log(scrap*) = By + Bigrant + u, 


where scrap” is the true scrap rate and grant is the dummy variable indicating whether a firm received 
a grant. The measurement error equation is 


log(scrap) = log(scrap*) + ep. 
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Is the measurement error, eọ, independent of whether the firm receives a grant? A cynical person 
might think that a firm receiving a grant is more likely to underreport its scrap rate in order to make 
the grant look effective. If this happens, then, in the estimable equation, 


log(scrap) = Bo + Bigrant + u + ep, 


the error u + egis negatively correlated with grant. This would produce a downward bias in B,, which 
would tend to make the training program look more effective than it actually was. (Remember, a more 
negative 6, means the program was more effective, because increased worker productivity is associ- 
ated with a lower scrap rate.) 


The bottom line of this subsection is that measurement error in the dependent variable can cause 
biases in OLS if it is systematically related to one or more of the explanatory variables. If the meas- 
urement error is just a random reporting error that is independent of the explanatory variables, as is 
often assumed, then OLS is perfectly appropriate. 


9-4b Measurement Error in an Explanatory Variable 


Traditionally, measurement error in an explanatory variable has been considered a much more impor- 
tant problem than measurement error in the dependent variable. In this subsection, we will see why 
this is the case. 

We begin with the simple regression model 


Y = Bo + Bix} + u, [9.27] 


and we assume that this satisfies at least the first four Gauss-Markov assumptions. This means that 
estimation of (9.27) by OLS would produce unbiased and consistent estimators of By) and B,. The 
problem is that x; is not observed. Instead, we have a measure of xj; call it x,. For example, x; could 
be actual income and x, could be reported income. 

The measurement error in the population is simply 


e =X, i, [9.28] 


and this can be positive, negative, or zero. We assume that the average measurement error in the 
population is zero: E(e}) = 0. This is natural, and, in any case, it does not affect the important con- 
clusions that follow. A maintained assumption in what follows is that u is uncorrelated with x; and xy. 
In conditional expectation terms, we can write this as E(y|x}, x,) = E(y|x;), which just says that x, 
does not affect y after xj has been controlled for. We used the same assumption in the proxy variable 
case, and it is not controversial; it holds almost by definition. 

We want to know the properties of OLS if we simply replace x} with x, and run the regression 
of y on xı. They depend crucially on the assumptions we make about the measurement error. Two 
assumptions have been the focus in econometrics literature, and they both represent polar extremes. 
The first assumption is that e; is uncorrelated with the observed measure, x: 


Cov(x1,e;) = 0. [9.29] 


From the relationship in (9.28), if assumption (9.29) is true, then e, must be correlated with the unob- 
served variable x;. To determine the properties of OLS in this case, we write x} = x, — e, and plug 
this into equation (9.27): 


y = Bo + Bix + (u E Biei). [9.30] 


Because we have assumed that u and e, both have zero mean and are uncorrelated with x,, u — Bye; 
has zero mean and is uncorrelated with x,. It follows that OLS estimation with x, in place of x; 
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produces a consistent estimator of 6, (and also 6p). Because u is uncorrelated with e,, the variance 
of the error in (9.30) is Var(u — B,e,) = o} + Bio? . Thus, except when 8, = 0, measurement error 
increases the error variance. But this does not affect any of the OLS properties (except that the vari- 
ances of the Ê; will be larger than if we observe xj directly). 

The assumption that e, is uncorrelated with x, is analogous to the proxy variable assumption we 
made in Section 9-2. Because this assumption implies that OLS has all of its nice properties, this is 
not usually what econometricians have in mind when they refer to measurement error in an explana- 
tory variable. The classical errors-in-variables (CEV) assumption is that the measurement error is 
uncorrelated with the unobserved explanatory variable: 


Cov(xi,e,) = 0. [9.31] 


This assumption comes from writing the observed measure as the sum of the true explanatory variable 
and the measurement error, 


xX, =x, +e, 


and then assuming the two components of x, are uncorrelated. (This has nothing to do with assump- 
tions about u; we always maintain that u is uncorrelated with x; and x,, and therefore with e,.) 
If assumption (9.31) holds, then x, and e, must be correlated: 


Cov(x),e;) = E(xje,) = E(xie,) + Eli) = 0+ o}, = ge: [9.32] 


Thus, the covariance between x, and e, is equal to the variance of the measurement error under the 
CEV assumption. 

Referring to equation (9.30), we can see that correlation between x, and e, is going to cause 
problems. Because u and x, are uncorrelated, the covariance between x, and the composite error 
u — Bye, is 


Cov(x,,u = piei) T —ßCov(x;,e;) = =i. 


Thus, in the CEV case, the OLS regression of y on x, gives a biased and inconsistent estimator. 
Using the asymptotic results in Chapter 5, we can determine the amount of inconsistency in OLS. The 
probability limit of 6, is B, plus the ratio of the covariance between x, and u — 8,e, and the variance of x): 


a Cov(x,,u — Bie) 
plim(;) B + Var(x,) 
_ Bio, 
p o + o 


[9.33] 


af (1 oF ) 
a, + o}, 
g” 
B (= + a) 


where we have used the fact that Var(x,) = Var(x}) + Var(e,). 

Equation (9.33) is very interesting. The term multiplying B,, which is the ratio Var(x,)/Var(x,), 
is always less than one [an implication of the CEV assumption (9.31)]. Thus, plim(,) is always 
closer to zero than is B,. This is called the attenuation bias in OLS due to CEV: on average (or in 
large samples), the estimated OLS effect will be attenuated. In particular, if B, is positive, Êi will tend 
to underestimate 6;. This is an important conclusion, but it relies on the CEV setup. 

If the variance of x; is large relative to the variance in the measurement error, then the inconsist- 
ency in OLS will be small. This is because Var(x;)/Var(x,) will be close to unity when Tilo, is 
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large. Therefore, depending on how much variation there is in x; relative to e}, measurement error 
need not cause large biases. 

Things are more complicated when we add more explanatory variables. For illustration, consider 
the model 


y = Bo + Bix} + Box, + Box, + u, [9.34] 


where the first of the three explanatory variables is measured with error. We make the natural assump- 
tion that u is uncorrelated with xj, x5, x3, and x. Again, the crucial assumption concerns the measure- 
ment error e,. In almost all cases, e, is assumed to be uncorrelated with x, and x,—the explanatory 
variables not measured with error. The key issue is whether e, is uncorrelated with x,. If it is, then the 
OLS regression of y on x,, X2, and x, produces consistent estimators. This is easily seen by writing 


y = Bo + Bix, + Box + Box; + u — pie, [9.35] 


where u and e; are both uncorrelated with all the explanatory variables. 

Under the CEV assumption in (9.31), OLS will be biased and inconsistent, because e} is cor- 
related with x, in equation (9.35). Remember, this means that, in general, all OLS estimators will be 
biased, not just Ê.. What about the attenuation bias derived in equation (9.33)? It turns out that there is 
still an attenuation bias for estimating 6: it can be shown that 


A g 
plim(B,) = p(s), [9.36] 


where rj; is the population error in the equation x; = @ + a,x, + ax, + ri. Formula (9.36) also 
works in the general k variable case when x, is the only mismeasured variable. 

Things are less clear-cut for estimating the £, on the variables not measured with error. In the 
special case that x; is uncorrelated with x, and x3, B» and B, are consistent. But this is rare in prac- 
tice. Generally, measurement error in a single variable causes inconsistency in all estimators. 
Unfortunately, the sizes, and even the directions of the biases, are not easily derived. 


GPA Equation with Measurement Error 


Consider the problem of estimating the effect of family income on college grade point average, after 
controlling for hsGPA (high school grade point average) and SAT (scholastic aptitude test). It could 
be that, though family income is important for performance before college, it has no direct effect on 
college performance. To test this, we might postulate the model 


colGPA = By + B,faminc* + B,hsGPA + B,SAT + u, 


where faminc* is actual annual family income. (This might appear in logarithmic form, but for the 
sake of illustration we leave it in level form.) Precise data on colGPA, hsGPA, and SAT are relatively 
easy to obtain. But family income, especially as reported by students, could be easily mismeasured. If 
faminc = faminc* + e, and the CEV assumptions hold, then using reported family income in place 
of actual family income will bias the OLS estimator of 8, toward zero. One consequence of the down- 
ward bias is that a test of Hp: 8B, = O will have less chance of detecting 6, > 0. 


Of course, measurement error can be present in more than one explanatory variable, or in some 
explanatory variables and the dependent variable. As we discussed earlier, any measurement error 
in the dependent variable is usually assumed to be uncorrelated with all the explanatory variables, 
whether it is observed or not. Deriving the bias in the OLS estimators under extensions of the CEV 
assumptions is complicated and does not lead to clear results. 
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In some cases, it is clear that the CEV assumption in (9.31) cannot be true. Consider a variant on 
Example 9.7: 


colGPA = By) + B\smoked" + B,hsGPA + B,SAT + u, 


where smoked" is the actual number of times a student smoked marijuana in the last 30 days. The 
variable smoked is the answer to this question: On how many separate occasions did you smoke mari- 
juana in the last 30 days? Suppose we postulate the standard measurement error model 


smoked = smoked" + e. 


Even if we assume that students try to report the truth, the CEV assumption is unlikely to hold. People 
who do not smoke marijuana at all—so that smoked" = 0—are likely to report smoked = 0, so the 
measurement error is probably zero for students who never smoke marijuana. When smoked" > 0, 
it is much more likely that the student miscounts how many times he or she smoked marijuana in 
the last 30 days. This means that the measurement error e, and the actual number of times smoked, 
smoked", are correlated, which violates the CEV assumption in (9.31). Unfortunately, deriving the 
implications of measurement error that do not satisfy 
(9.29) or (9.31) is difficult and beyond the scope of f GOING FURTHER 9.3 
this text. 
Before leaving this section, we emphasize that | Let educ* be actual amount of schooling, 
the CEV assumption (9.31), while more believable measured in years (which can be a noninte- 
than assumption (9.29), is still a strong assumption. 
The truth is probably somewhere in between, and if 
e; is correlated with both x; and x,, OLS is inconsist- 
ent. This raises an important question: must we live 
with inconsistent estimators under CEV, or other kinds of measurement error that are correlated with 
xı? Fortunately, the answer is no. Chapter 15 shows how, under certain assumptions, the parameters 
can be consistently estimated in the presence of general measurement error. We postpone this discus- 
sion until later because it requires us to leave the realm of OLS estimation. (See Problem 7 for how 
multiple measures can be used to reduce the attenuation bias.) 


ger), and let educ be reported highest grade 
completed. Do you think educ and educ* 
are related by the CEV model? 


9-5 Missing Data, Nonrandom Samples, and Outlying Observations 


The measurement error problem discussed in the previous section can be viewed as a data problem: we 
cannot obtain data on the variables of interest. Further, under the CEV model, the composite error term 
is correlated with the mismeasured independent variable, violating the Gauss-Markov assumptions. 

Another data problem we discussed frequently in earlier chapters is multicollinearity among the 
explanatory variables. Remember that correlation among the explanatory variables does not violate 
any assumptions. When two independent variables are highly correlated, it can be difficult to estimate 
the partial effect of each. But this is properly reflected in the usual OLS statistics. 

In this section, we provide an introduction to data problems that can violate the random sampling 
assumption, MLR.2. We can isolate cases in which nonrandom sampling has no practical effect on 
OLS. In other cases, nonrandom sampling causes the OLS estimators to be biased and inconsistent. 
A more complete treatment that establishes several of the claims made here is given in Chapter 17. 


9-5a Missing Data 


The missing data problem can arise in a variety of forms. Often, we collect a random sample of 
people, schools, cities, and so on, and then discover later that information is missing on some key 
variables for several units in the sample. For example, in the data set BWGHT, 196 of the 1,388 
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observations have no information on father’s education. In the data set on median starting law school 
salaries, LAWSCH85, 6 of the 156 schools have no reported information on median LSAT scores for 
the entering class; other variables are also missing for some of the law schools. 

If data are missing for an observation on either the dependent variable or one of the independ- 
ent variables, then the observation cannot be used in a standard multiple regression analysis. In fact, 
provided missing data have been properly indicated, all modern regression packages keep track of 
missing data and simply ignore observations when computing a regression. We saw this explicitly in 
the birth weight scenario in Example 4.9, when 197 observations were dropped due to missing infor- 
mation on parents’ education. 

In the literature on missing data, an estimator that uses only observations with a complete set of 
data on y and x,,..., x, is called a complete cases estimator; as mentioned earlier, this estimator is 
computed as the default for OLS (and all estimators covered later in the text). Other than reducing the 
sample size, are there any statistical consequences of using the OLS estimator and ignoring the miss- 
ing data? If, in the language of the missing data literature [see, for example, Little and Rubin (2002, 
Chapter 1)] the data are missing completely at random (sometimes called MCAR), then missing 
data cause no statistical problems. The MCAR assumption implies that the reason the data are miss- 
ing is independent, in a statistical sense, of both the observed and unobserved factors affecting y. In 
effect, we can still assume that the data have been obtained by random sampling from the population, 
so that Assumption MLR.2 continues to hold. 

When MCAR holds, there are ways to use partial information obtained from units that are 
dropped from the complete case estimation. Unfortunately, some simple strategies produce consistent 
estimators only under strong assumptions—in addition to MCAR. As illustration, suppose that for a 
multiple regression model data are always available for y and x,, x, ... , x,_; but are sometimes miss- 
ing for the explanatory variable x,. A common “solution” is to create two new variables. For a unit i, 
the first variable, say Z; is defined to be x; when xis observed, and zero otherwise. The second vari- 
able is a “missing data indicator,” say m, which equals one when xis missing and equals zero when 
x; is observed. Having defined these two variables, all of the units are used in the regression 
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It is easy to understand the appeal of this procedure, which we call the missing indicator method 
(MIM). Suppose that the original sample size is n = 1,000 but x, is missing for 30% of the cases. The 
complete cases estimator would use only 700 observations while the MIM regression would be based 
on all 1,000 cases. Unfortunately, the gain in observations is largely illusory, as the MIM estimator 
only has good statistical properties under strong assumptions. In particular, in addition to MCAR, 
consistency essentially requires that x, is uncorrelated with the other explanatory variables, x), x», . 
..,X,.1, aS discussed in Jones (1996) and expanded on in Abrevaya and Donald (2018). Of course, 
it is difficult to know if the bias and inconsistency in MIM is practically important, but we have no 
way of generally knowing. One thing we can be sure of is that it is a very poor idea to omit m; from 
the regression, as that is the same as setting x, equal to zero whenever it is missing. Problem 9.10 
works through how MCAR is sufficient for consistency in the simple regression model. The reader is 
referred to Abrevaya and Donald (2018) to see why MCAR is not sufficient when other regressors are 
included. In addition, Abrevaya and Donald (2018) discuss more robust ways to include information 
when some variables have missing data. The methods are too advanced for the scope of this text. 

An important consequence of the previous discussion is that MIM is substantially less robust 
than the complete case estimator in the sense that the MIM approach requires much stronger assump- 
tions for consistency. As we will see in the next subsection, the complete cases estimator turns out to 
be consistent even when the reason the data are missing is a function of (x1, X2, . . . , X4), something 
explicitly ruled out by MCAR. Plus, the complete cases estimator puts no restrictions on the correla- 
tions among (X1, Xo, ... , Xp). 
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There are more complicated schemes for using partial information that are based on filling in the 
missing data, but these are beyond the scope of this text. The reader is referred to Little and Rubin 
(2002) and Abrevaya and Donald (2018). 


9-5b Nonrandom Samples 


The MCAR assumption ensures that units for which we observe a full set of data are not system- 
atically different from units for which some variables are missing. Unfortunately, MCAR is often 
unrealistic. An example of a missing data mechanism that does not satisfy MCAR can be gotten by 
looking at the data set CARD, where the measure of IQ is missing for 949 men. If the probability that 
the IQ score is missing is, say, higher for men with lower IQ scores, the mechanism violates MCAR. 
For example, in the birth weight data set, what if the probability that education is missing is higher for 
those people with lower than average levels of education? Or, in Section 9-2, we used a wage data set 
that included IQ scores. This data set was constructed by omitting several people from the sample for 
whom IQ scores were not available. If obtaining an IQ score is easier for those with higher IQs, the 
sample is not representative of the population. The random sampling assumption MLR.2 is violated, 
and we must worry about these consequences for OLS estimation. 

Fortunately, certain types of nonrandom sampling do not cause bias or inconsistency in OLS. 
Under the Gauss-Markov assumptions (but without MLR.2), it turns out that the sample can be cho- 
sen on the basis of the independent variables without causing any statistical problems. This is called 
sample selection based on the independent variables, and it is an example of exogenous sample 
selection. 

In the statistics literature, exogenous sample selection due to missing data is often called missing 
at random (MAR), which is not a particularly good label because the probability of missing data 
is allowed to depend on the explanatory variables. The word “random” would seem to connote that 
missingness cannot depend systematically on anything, but that is actually the intention of the phrase 
“completely at random.” In other words, MAR requires that missingness is unrelated to u but allows 
it to depend on (x), X2, . . - , X4), whereas MCAR means the missingness is unrelated to (x1, X2,... , X,) 
and u. See Little and Rubin (2002, Chapter 1) for further discussion. 

To illustrate exogenously missing data, suppose that we are estimating a saving function, where 
annual saving depends on income, age, family size, and some unobserved factors, u. A simple model is 


saving = By + Bincome + Bage + B3size + u. [9.37] 


Suppose that our data set was based on a survey of people over 35 years of age, thereby leaving us 
with a nonrandom sample of all adults. While this is not ideal, we can still get unbiased and consist- 
ent estimators of the parameters in the population model (9.37), using the nonrandom sample. We 
will not show this formally here, but the reason OLS on the nonrandom sample is unbiased is that the 
regression function E(saving|income,age,size) is the same for any subset of the population described 
by income, age, or size. Provided there is enough variation in the independent variables in the sub- 
population, selection on the basis of the independent variables is not a serious problem, other than 
that it results in smaller sample sizes. 

In the IQ example just mentioned, things are not so clear-cut, because no fixed rule based on IQ 
is used to include someone in the sample. Rather, the probability of being in the sample increases 
with IQ. If the other factors determining selection into the sample are independent of the error term 
in the wage equation, then we have another case of exogenous sample selection, and OLS using the 
selected sample will have all of its desirable properties under the other Gauss-Markov assumptions. 

The situation is much different when selection is based on the dependent variable, y, which is 
called sample selection based on the dependent variable and is an example of endogenous sample 
selection. If the sample is based on whether the dependent variable is above or below a given value, 
bias always occurs in OLS in estimating the population model. For example, suppose we wish to 
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GOING FURTHER 9.4 


Suppose we are interested in the effects of 
campaign expenditures by incumbents on 
voter support. Some incumbents choose 
not to run for reelection. If we can only 
collect voting and spending outcomes on 
incumbents that actually do run, is there 
likely to be endogenous sample selection? 


estimate the relationship between individual wealth and several other factors in the population of 
all adults: 


wealth = By + B,educ + B,exper + Bage + u. [9.38] 


Suppose that only people with wealth below $250,000 are included in the sample. This is a nonrandom 
sample from the population of interest, and it is based on the value of the dependent variable. Using a 
sample of people with wealth below $250,000 will result in biased and inconsistent estimators of the 
parameters in (9.32). Briefly, this occurs because the population regression E( wealth|educ,exper,age) 
is not the same as the expected value conditional on wealth being less than $250,000. 

Other sampling schemes lead to nonrandom samples from the population, usually intentionally. 
A common method of data collection is stratified sampling, in which the population is divided into 
nonoverlapping, exhaustive groups, or strata. Then, some groups are sampled more frequently than is 
dictated by their population representation, and some groups are sampled less frequently. For exam- 
ple, some surveys purposely oversample minority groups or low-income groups. Whether special 
methods are needed again hinges on whether the stratification is exogenous (based on exogenous 
explanatory variables) or endogenous (based on the dependent variable). Suppose that a survey of 
military personnel oversampled women because the initial interest was in studying the factors that 
determine pay for women in the military. (Oversampling a group that is relatively small in the popu- 
lation is common in collecting stratified samples.) Provided men were sampled as well, we can use 
OLS on the stratified sample to estimate any gender differential, along with the returns to education 
and experience for all military personnel. (We might be willing to assume that the returns to education 
and experience are not gender specific.) OLS is unbiased and consistent because the stratification is 
with respect to an explanatory variable, namely, gender. 

If, instead, the survey oversampled lower-paid military personnel, then OLS using the strati- 
fied sample does not consistently estimate the parameters of the military wage equation because the 
stratification is endogenous. In such cases, special econometric methods are needed [see Wooldridge 
(2010, Chapter 19)]. 

Stratified sampling is a fairly obvious form of nonrandom sampling. Other sample selection 
issues are more subtle. For instance, in several previous examples, we have estimated the effects 
of various variables, particularly education and experience, on hourly wage. The data set WAGE1 
that we have used throughout is essentially a random sample of working individuals. Labor econo- 
mists are often interested in estimating the effect of, say, education on the wage offer. The idea 
is this: every person of working age faces an hourly wage offer, and he or she can either work at 
that wage or not work. For someone who does work, the wage offer is just the wage earned. For 
people who do not work, we usually cannot observe the wage offer. Now, because the wage offer 
equation 


log(wage°) = By + Bieduc + Brexper + u [9.39] 


represents the population of all working-age people, we cannot estimate it using a random sample 
from this population; instead, we have data on the wage offer only for working people (although 
we can get data on educ and exper for nonworking people). If we use 
a random sample of working people to estimate (9.39), will we get 
unbiased estimators? This case is not clear-cut. Because the sample 
is selected based on someone’s decision to work (as opposed to the 
size of the wage offer), this is not like the previous case. However, 
because the decision to work might be related to unobserved factors 
that affect the wage offer, selection might be endogenous, and this can 
result in a sample selection bias in the OLS estimators. We will cover 
methods that can be used to test and correct for sample selection bias 
in Chapter 17. 
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9-5c Outliers and Influential Observations 


In some applications, especially, but not only, with small data sets, the OLS estimates are sensitive to the 
inclusion of one or several observations. A complete treatment of outliers and influential observations is 
beyond the scope of this book, because a formal development requires matrix algebra. Loosely speaking, 
an observation is an influential observation if dropping it from the analysis changes the key OLS estimates 
by a practically “large” amount. The notion of an outlier is also a bit vague, because it requires comparing 
values of the variables for one observation with those for the remaining sample. Nevertheless, one wants 
to be on the lookout for “unusual” observations because they can greatly affect the OLS estimates. 

OLS is susceptible to outlying observations because it minimizes the sum of squared residuals: 
large residuals (positive or negative) receive a lot of weight in the least squares minimization prob- 
lem. If the estimates change by a practically large amount when we slightly modify our sample, we 
should be concerned. 

When statisticians and econometricians study the problem of outliers theoretically, sometimes the 
data are viewed as being from a random sample from a given population—albeit with an unusual dis- 
tribution that can result in extreme values—and sometimes the outliers are assumed to come from a 
different population. From a practical perspective, outlying observations can occur for two reasons. 
The easiest case to deal with is when a mistake has been made in entering the data. Adding extra zeros 
to a number or misplacing a decimal point can throw off the OLS estimates, especially in small sample 
sizes. It is always a good idea to compute summary statistics, especially minimums and maximums, in 
order to catch mistakes in data entry. Unfortunately, incorrect entries are not always obvious. 

Outliers can also arise when sampling from a small population if one or several members of the 
population are very different in some relevant aspect from the rest of the population. The decision 
to keep or drop such observations in a regression analysis can be a difficult one, and the statistical 
properties of the resulting estimators are complicated. Outlying observations can provide important 
information by increasing the variation in the explanatory variables (which reduces standard errors). 
But OLS results should probably be reported with and without outlying observations in cases where 
one or several data points substantially change the results. 


EXAMPLE 9.8 R&D Intensity and Firm Size 


Suppose that R&D expenditures as a percentage of sales (rdintens) are related to sales (in millions) 
and profits as a percentage of sales (profmarg): 


rdintens = By + B,sales + By profmarg + u. [9.40] 
The OLS equation using data on 32 chemical companies in RDCHEM is 


e 
rdintens = 2.625 + .000053 sales + .0446 profmarg 
(0.586) (.000044) (.0462) 
n = 32, R = .0761, R? = .0124. 


Neither sales nor profmarg is statistically significant at even the 10% level in this regression. 

Of the 32 firms, 31 have annual sales less than $20 billion. One firm has annual sales of almost 
$40 billion. Figure 9.1 shows how far this firm is from the rest of the sample. In terms of sales, this 
firm is over twice as large as every other firm, so it might be a good idea to estimate the model with- 
out it. When we do this, we obtain 


rdintens = 2.297 + .000186 sales + .0478 profmarg 
(0.592) (.000084) (.0445) 
n = 31, R? = .1728, R? = 1137. 
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FIGURE 9.1 Scatterplot of R&D intensity against firm sales. 
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When the largest firm is dropped from the regression, the coefficient on sales more than triples, 
and it now has a f statistic over two. Using the sample of smaller firms, we would conclude that there 
is a statistically significant positive effect between R&D intensity and firm size. The profit margin is 
still not significant, and its coefficient has not changed by much. 


Sometimes, outliers are defined by the size of the residual in an OLS regression, where all of the 
observations are used. Generally, this is not a good idea because the OLS estimates adjust to make the 
sum of squared residuals as small as possible. In the previous example, including the largest firm flat- 
tened the OLS regression line considerably, which made the residual for that estimation not especially 
large. In fact, the residual for the largest firm is —1.62 when all 32 observations are used. This value of 
the residual is not even one estimated standard deviation, 6 = 1.82, from the mean of the residuals, 
which is zero by construction. 

Studentized residuals are obtained from the original OLS residuals by dividing them by an esti- 
mate of their standard deviation (conditional on the explanatory variables in the sample). The formula 
for the studentized residuals relies on matrix algebra, but it turns out there is a simple trick to compute 
a studentized residual for any observation. Namely, define a dummy variable equal to one for that 
observation—say, observation h—and then include the dummy variable in the regression (using all 
observations) along with the other explanatory variables. The coefficient on the dummy variable has a 
useful interpretation: it is the residual for observation A computed from the regression line using only 
the other observations. Therefore, the dummy’s coefficient can be used to see how far off the observa- 
tion is from the regression line obtained without using that observation. Even better, the rf statistic on 
the dummy variable is equal to the studentized residual for observation h. Under the classical linear 
model assumptions, this f statistic has a f,_,_» distribution. Therefore, a large value of the ¢ statistic (in 
absolute value) implies a large residual relative to its estimated standard deviation. 
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For Example 9.8, if we define a dummy variable for the largest firm (observation 10 in the data 
file), and include it as an additional regressor, its coefficient is —6.57, verifying that the observa- 
tion for the largest firm is very far from the regression line obtained using the other observations. 
However, when studentized, the residual is only —1.82. While this is a marginally significant f statistic 
(two-sided p-value = .08), it is not close to being the largest studentized residual in the sample. If 
we use the same method for the observation with the highest value of rdintens—the first observa- 
tion, with rdintens = 9.42—the coefficient on the dummy variable is 6.72 with a ¢ statistic of 4.56. 
Therefore, by this measure, the first observation is more of an outlier than the tenth. Yet dropping 
the first observation changes the coefficient on sales by only a small amount (to about .000051 from 
.000053), although the coefficient on profmarg becomes larger and statistically significant. So, is the 
first observation an “outlier” too? These calculations show the conundrum one can enter when trying 
to determine observations that should be excluded from a regression analysis, even when the data set 
is small. Unfortunately, the size of the studentized residual need not correspond to how influential an 
observation is for the OLS slope estimates, and certainly not for all of them at once. 

A general problem with using studentized residuals is that, in effect, all other observations are 
used to estimate the regression line to compute the residual for a particular observation. In other 
words, when the studentized residual is obtained for the first observation, the tenth observation has 
been used in estimating the intercept and slope. Given how flat the regression line is with the largest 
firm (tenth observation) included, it is not too surprising that the first observation, with its high value 
of rdintens, is far off the regression line. 

Of course, we can add two dummy variables at the same time—one for the first observation 
and one for the tenth—which has the effect of using only the remaining 30 observations to 
estimate the regression line. If we estimate the equation without the first and tenth observations, the 
results are 


rdintens = 1.939 + .000160 sales + .0701 profmarg 
(0.459) (.00065) (.0343) 
n = 30, R = .2711, R = .2171. 


The coefficient on the dummy for the first observation is 6.47 (t = 4.58), and for the tenth observa- 
tion it is —5.41 (t = —1.95). Notice that the coefficients on the sales and profmarg are both statisti- 
cally significant, the latter at just about the 5% level against a two-sided alternative (p-value = .051). 
Even in this regression there are still two observations with studentized residuals greater than two 
(corresponding to the two remaining observations with R&D intensities above six). 

Certain functional forms are less sensitive to outlying observations. In Section 6-2 we mentioned 
that, for most economic variables, the logarithmic transformation significantly narrows the range of 
the data and also yields functional forms—such as constant elasticity models—that can explain a 
broader range of data. 


EXAMPLE 9.9 R&D Intensity 


We can test whether R&D intensity increases with firm size by starting with the model 
rd = sales®'exp(B) + B.profmarg + u). [9.41] 


Then, holding other factors fixed, R&D intensity increases with sales if and only if 6B, > 1. Taking 
the log of (9.41) gives 


log(rd) = By + B,log(sales) + B,profmarg + u. [9.42] 
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When we use all 32 firms, the regression equation is 


—_—_—__~ 
log(rd) = —4.378 + 1.084 log(sales) + .0217 profmarg, 
(.468) (.060) (.0128) 
n = 32, R? = 9180, R° = .9123, 


while dropping the largest firm gives 


_————_ 


log(rd) = —4.404 + 1.088 log(sales) + .0218 profmarg, 
(.511)  (.067) (.0130) 
n = 31, R? = 9037, R? = .8968. 


Practically, these results are the same. In neither case do we reject the null Hy: 6; = 1 against 
H,: B, > 1. (Why?) 


In some cases, certain observations are suspected at the outset of being fundamentally different 
from the rest of the sample. This often happens when we use data at very aggregated levels, such as 
the city, county, or state level. The following is an example. 


EXAMPLE 9.10 State Infant Mortality Rates 


Data on infant mortality, per capita income, and measures of health care can be obtained at the state 
level from the Statistical Abstract of the United States. We will provide a fairly simple analysis here 
just to illustrate the effect of outliers. The data are for the year 1990, and we have all 50 states in 
the United States, plus the District of Columbia (D.C.). The variable infmort is number of deaths 
within the first year per 1,000 live births, pcinc is per capita income, physic is physicians per 100,000 
members of the civilian population, and popul is the population (in thousands). The data are contained 
in INFMRT. We include all independent variables in logarithmic form: 


—_— ~~. 
infmort = 33.86 — 4.68 log(pcinc) + 4.15 log(physic) 
(20.43) (2.60) (1.51) 
— .088 log(popul) [9.43] 
(.287) 
n = 51, R? = .139, R? = .084. 
Higher per capita income is estimated to lower infant mortality, an expected result. But more physi- 
cians per capita is associated with higher infant mortality rates, something that is counterintuitive. 
Infant mortality rates do not appear to be related to population size. 
The District of Columbia is unusual in that it has pockets of extreme poverty and great wealth in 
a small area. In fact, the infant mortality rate for D.C. in 1990 was 20.7, compared with 12.4 for the 
highest state. It also has 615 physicians per 100,000 of the civilian population, compared with 337 for 
the highest state. The high number of physicians coupled with the high infant mortality rate in D.C. 
could certainly influence the results. If we drop D.C. from the regression, we obtain 
— ~~. 
infmort = 23.95 — .57 log(peinc) — 2.74 log(physic) 
(12.42) (1.64) (1.19) 
+ .629 log(popul) [9.44] 
(.191) 
n = 50, R = .273, R? = .226. 


CHAPTER 9 More on Specification and Data Issues 321 


We now find that more physicians per capita lowers infant mortality, and the estimate is statisti- 
cally different from zero at the 5% level. The effect of per capita income has fallen sharply and is no 
longer statistically significant. In equation (9.44), infant mortality rates are higher in more populous 
states, and the relationship is very statistically significant. Also, much more variation in infmort is 
explained when D.C. is dropped from the regression. Clearly, D.C. had substantial influence on the 
initial estimates, and we would probably leave it out of any further analysis. 


As Example 9.8 demonstrates, inspecting observations in trying to determine which are outliers, and 
even which ones have substantial influence on the OLS estimates, is a difficult endeavor. More advanced 
treatments allow more formal approaches to determine which observations are likely to be influential 
observations. Using matrix algebra, Belsley, Kuh, and Welsh (1980) define the leverage of an observa- 
tion, which formalizes the notion that an observation has a large or small influence on the OLS estimates. 
These authors also provide a more in-depth discussion of standardized and studentized residuals. 


9-6 Least Absolute Deviations Estimation 


Rather than trying to determine which observations, if any, have undue influence on the OLS esti- 
mates, a different approach to guarding against outliers is to use an estimation method that is less 
sensitive to outliers than OLS. One such method, which has become popular among applied econo- 
metricians, is called least absolute deviations (LAD). The LAD estimators of the 6; in a linear model 
minimize the sum of the absolute values of the residuals, 


n 
b, min be > ly: by — bixan = — bX- [9.45] 
Unlike OLS, which minimizes the sum of squared residuals, the LAD estimates are not available in 
closed form—that is, we cannot write down formulas for them. In fact, historically, solving the prob- 
lem in equation (9.45) was computationally difficult, especially with large sample sizes and many 
explanatory variables. But with the vast improvements in computational speed over the past two dec- 
ades, LAD estimates are fairly easy to obtain even for large data sets. 

Figure 9.2 shows the OLS and LAD objective functions. The LAD objective function is linear on 
either side of zero, so that if, say, a positive residual increases by one unit, the LAD objective function 
increases by one unit. By contrast, the OLS objective function gives increasing importance to large 
residuals, and this makes OLS more sensitive to outlying observations. 

Because LAD does not give increasing weight to larger residuals, it is much less sensitive to 
changes in the extreme values of the data than OLS. In fact, it is known that LAD is designed to esti- 
mate the parameters of the conditional median of y given x,, X2, . . . , x, rather than the conditional 
mean. Because the median is not affected by large changes in the extreme observations, it follows that 
the LAD parameter estimates are more resilient to outlying observations. (See Section A-1 for a brief 
discussion of the sample median.) In choosing the estimates, OLS squares each residual, and so the 
OLS estimates can be very sensitive to outlying observations, as we saw in Examples 9.8 and 9.10. 

In addition to LAD being more computationally intensive than OLS, a second drawback of LAD 
is that all statistical inference involving the LAD estimators is justified only as the sample size grows. 
[The formulas are somewhat complicated and require matrix algebra, and we do not need them here. 
Koenker (2005) provides a comprehensive treatment.] Recall that, under the classical linear model 
assumptions, the OLS f¢ statistics have exact f distributions, and F statistics have exact F distribu- 
tions. While asymptotic versions of these statistics are available for LAD—and reported routinely by 
software packages that compute LAD estimates—these are justified only in large samples. Like the 
additional computational burden involved in computing LAD estimates, the lack of exact inference 
for LAD is only of minor concern, because most applications of LAD involve several hundred, if not 
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FIGURE 9.2 The OLS and LAD objective functions. 
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several thousand, observations. Of course, we might be pushing it if we apply large-sample approxi- 
mations in an example such as Example 9.8, with n = 32. In a sense, this is not very different from 
OLS because, more often than not, we must appeal to large sample approximations to justify OLS 
inference whenever any of the CLM assumptions fail. 

A more subtle but important drawback to LAD is that it does not always consistently estimate 
the parameters appearing in the conditional mean function, E(y|x,,...,x,). As mentioned ear- 
lier, LAD is intended to estimate the effects on the conditional median. Generally, the mean and 
median are the same only when the distribution of y given the covariates x,,..., x, is symmetric 
about By + Byx; +- + Bx; (Equivalently, the population error term, u, is symmetric about zero.) 
Recall that OLS produces unbiased and consistent estimators of the parameters in the conditional 
mean whether or not the error distribution is symmetric; symmetry does not appear among the Gauss- 
Markov assumptions. When LAD and OLS are applied to cases with asymmetric distributions, the 
estimated partial effect of, say, xı, obtained from LAD can be very different from the partial effect 
obtained from OLS. But such a difference could just reflect the difference between the median and 
the mean and might not have anything to do with outliers. See Computer Exercise C9 for an example. 

If we assume that the population error u in model (9.2) is independent of (x, . . . , x;,), then the 
OLS and LAD slope estimates should differ only by sampling error whether or not the distribution of 
u is symmetric. The intercept estimates generally will be different to reflect the fact that, if the mean 
of u is zero, then its median is different from zero under asymmetry. Unfortunately, independence 
between the error and the explanatory variables is often unrealistically strong when LAD is applied. 
In particular, independence rules out heteroskedasticity, a problem that often arises in applications 
with asymmetric distributions. 

An advantage that LAD has over OLS is that, because LAD estimates the median, it is easy to 
obtain partial effects—and predictions—using monotonic transformations. Here we consider the most 
common transformation, taking the natural log. Suppose that log(y) follows a linear model where the 
error has a zero conditional median: 


log(y) = By + xB +u [9.46] 


Med(ulx) = 0, [9.47] 
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which implies that 


Med[log(y)|x] = By + xB. 


A well-known feature of the conditional median—see, for example, Wooldridge (2010, Chapter 12)— 
is that it passes through increasing functions. Therefore, 


Med(y|x) = exp(By + xB). [9.48] 


It follows that 6; is the semi-elasticity of Med(y|x) with respect to x;. In other words, the partial effect 
of x; in the linear equation (9.46) can be used to uncover the partial effect in the nonlinear model 
(9.48). It is important to understand that this holds for any distribution of u such that (9.47) holds, and 
we need not assume u and x are independent. By contrast, if we specify a linear model for E[log(y) |x] 
then, in general, there is no way to uncover E(y|x). If we make a full distributional assumption for 
u given x then, in principle, we can recover E(y|x). We covered the special case in equation (6.40) 
under the assumption that log(y) follows a classical linear model. However, in general there is no 
way to find E(y|x) from a model for E[log(y)|x], even though we can always obtain Med(y|x) from 
Med[log(y)|x]. Problem 9 investigates how heteroskedasticity in a linear model for log(y) confounds 
our ability to find E(y|x). 

LAD is a special case of what is often called robust regression. Unfortunately, the way “robust” 
is used here can be confusing. In the statistics literature, a robust regression estimator is relatively 
insensitive to extreme observations. Effectively, observations with large residuals are given less 
weight than in least squares. [Berk (1990) contains an introductory treatment of estimators that are 
robust to outlying observations.] Based on our earlier discussion, in econometric parlance, LAD is not 
a robust estimator of the conditional mean because it requires extra assumptions in order to consist- 
ently estimate the conditional mean parameters. In Equation (9.2), either the distribution of u given 
(x),...,X;) has to be symmetric about zero, or u must be independent of (x), . . . , x,). Neither of 
these is required for OLS. 

LAD is also a special case of quantile regression, which is used to estimate the effect of the x; on 
different parts of the distribution—not just the median (or mean). For example, in a study to see how 
having access to a particular pension plan affects wealth, it could be that access affects high-wealth 
people differently from low-wealth people, and these effects both differ from the median person. 
Wooldridge (2010, Chapter 12) contains a treatment and examples of quantile regression. 


Summary 


We have further investigated some important specification and data issues that often arise in empirical 
cross-sectional analysis. Misspecified functional form makes the estimated equation difficult to interpret. 
Nevertheless, incorrect functional form can be detected by adding quadratics, computing RESET, or testing 
against a nonnested alternative model using the Davidson-MacKinnon test. No additional data collection 
is needed. 

Solving the omitted variables problem is more difficult. In Section 9-2, we discussed a possible solu- 
tion based on using a proxy variable for the omitted variable. Under reasonable assumptions, including 
the proxy variable in an OLS regression eliminates, or at least reduces, bias. The hurdle in applying this 
method is that proxy variables can be difficult to find. A general possibility is to use data on a dependent 
variable from a prior year. 

Applied economists are often concerned with measurement error. Under the classical errors-in- 
variables (CEV) assumptions, measurement error in the dependent variable has no effect on the statistical 
properties of OLS. In contrast, under the CEV assumptions for an independent variable, the OLS estimator 
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for the coefficient on the mismeasured variable is biased toward zero. The bias in coefficients on the other 
variables can go either way and is difficult to determine. 

Nonrandom samples from an underlying population can lead to biases in OLS. When sample selection 
is correlated with the error term u, OLS is generally biased and inconsistent. On the other hand, exogenous 
sample selection—which is either based on the explanatory variables or is otherwise independent of u— 
does not cause problems for OLS. Outliers in data sets can have large impacts on the OLS estimates, espe- 
cially in small samples. It is important to at least informally identify outliers and to reestimate models with 
the suspected outliers excluded. 

Least absolute deviations estimation is an alternative to OLS that is less sensitive to outliers and that 
delivers consistent estimates of conditional median parameters. In the past 20 years, with computational 
advances and improved understanding of the pros and cons of LAD and OLS, LAD is used more and more 
in empirical research—often as a supplement to OLS. 


A Key Terms 


Attenuation Bias 

Average Marginal Effect (AME) 
Average Partial Effect (APE) 
Classical Errors-in- Variables (CEV) 
Complete Cases Estimator 
Conditional Median 
Davidson-MacKinnon Test 
Endogenous Explanatory Variable 
Endogenous Sample Selection 
Exogenous Sample Selection 
Functional Form Misspecification 


Problems 


Influential Observations 

Lagged Dependent Variable 
Least Absolute Deviations (LAD) 
Measurement Error 

Missing at Random (MAR) 
Missing Completely at Random 
(MCAR) 

Missing Data 

Missing Indicator Method (MIM) 
Multiplicative Measurement Error 
Nonnested Models 


Nonrandom Sample 

Outliers 

Plug-In Solution to the Omitted 
Variables Problem 

Proxy Variable 

Random Coefficient (Slope) 
Model 

Regression Specification Error 
Test (RESET) 

Stratified Sampling 

Studentized Residuals 


1 In Problem 11 in Chapter 4, the R-squared from estimating the model 


log(salary) = By + B,log(sales) + B.log(mktval) + B,profmarg 


+ Byceoten + Bscomten + u, 


using the data in CEOSAL2, was R? = 353 (n = 177). When ceoten* and comten’ are added, 
R? = .375. Is there evidence of functional form misspecification in this model? 


2 Let us modify Computer Exercise C4 in Chapter 8 by using voting outcomes in 1990 for incumbents 
who were elected in 1988. Candidate A was elected in 1988 and was seeking reelection in 1990; 
voteA90 is Candidate A’s share of the two-party vote in 1990. The 1988 voting share of Candidate A is 
used as a proxy variable for quality of the candidate. All other variables are for the 1990 election. The 


following equations were estimated, using the data in VOTE2: 


—_— —~_. 
voteA90 = 75.71 + .312 prtystrA + 4.93 democA 


(9.25) (.046) (1.01) 
—.929 log(expendA) — 1.950 log(expendB) 
(.684) (281) 


n = 186, R? = .495, R? = .483, 
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and 


—_— —_ 
voteA90 = 70.81 + .282 prtystrA + 4.52 democA 


(10.01) (.052) (1.06) 
—.839 log(expendA) — 1.846 log(expendB) + .067 voteA88 
(.687) (.292) (.053) 


n = 186, R? = .499, R? = .485. 


(i) Interpret the coefficient on voteA88 and discuss its statistical significance. 
Gi) Does adding voteA8& have much effect on the other coefficients? 


3 Let math10 denote the percentage of students at a Michigan high school receiving a passing score on a 
standardized math test (see also Example 4.2). We are interested in estimating the effect of per-student 
spending on math performance. A simple model is 


mathl0 = By + Bilog(expend) + Bolog(enroll) + B;poverty + u, 


where poverty is the percentage of students living in poverty. 

Gi) The variable /nchprg is the percentage of students eligible for the federally funded school lunch 
program. Why is this a sensible proxy variable for poverty? 

(ii) The table that follows contains OLS estimates, with and without /nchprg as an explanatory variable. 


Dependent Variable: math10 


Independent Variables (1) (2) 
log( expend) ele 7.75 
(3.30) (3.04) 
log( enroll) 022 -1.26 
(.615) (.58) 
Inchprg = —.324 
(.036) 
intercept —69.24 —23.14 
(26.72) (24.99) 
Observations 428 428 
R-squared .0297 .1893 


Explain why the effect of expenditures on math10 is lower in column (2) than in column (1). Is 
the effect in column (2) still statistically greater than zero? 
(iii) Does it appear that pass rates are lower at larger schools, other factors being equal? Explain. 
(iv) Interpret the coefficient on /nchprg in column (2). 
(v) What do you make of the substantial increase in R? from column (1) to column (2)? 


4 The following equation explains weekly hours of television viewing by a child in terms of the child’s 
age, mother’s education, father’s education, and number of siblings: 


tvhours* = By + Bage + Boage? + Bymotheduc + Byfatheduc + Bssibs + u. 


We are worried that tvhours* is measured with error in our survey. Let tvhours denote the reported 
hours of television viewing per week. 

(i) | What do the classical errors-in-variables (CEV) assumptions require in this application? 

(ii) Do you think the CEV assumptions are likely to hold? Explain. 
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5 In Example 4.4, we estimated a model relating number of campus crimes to student enrollment for 


a sample of colleges. The sample we used was not a random sample of colleges in the United States, 
because many schools in 1992 did not report campus crimes. Do you think that college failure to report 
crimes can be viewed as exogenous sample selection? Explain. 


In the model (9.17), show that OLS consistently estimates a and £ if a; is uncorrelated with x; and b; 
is uncorrelated with x; and x7, which are weaker assumptions than (9.19). [Hint: Write the equation as 
in (9.18) and recall from Chapter 5 that sufficient for consistency of OLS for the intercept and slope is 
E(u;) = 0 and Cov(x;, u;) = 0.] 


Consider the simple regression model with classical measurement error, y = By + B\x° + u, where 
we have m measures on x". Write these as z, = x" + e„ h = 1,...,m. Assume that x* is uncorrelated 
with u, €;,... , €m, that the measurement errors are pairwise uncorrelated, and have the same variance, 
o2. Let w = (zı + +++ + z,,)/m be the average of the measures on x“, so that, for each observation 
i, w; = (za +- + Zim)/m is the average of the m measures. Let B, be the OLS estimator from the 
simple regression y; on 1, w, i = 1,...,, using a random sample of data. 

(i) Show that 


plim(B,) = Bif Zah 


a2. + (o?/m)] 


[Hint: The plim of Bı is Cov(w,y)/Var(w).] 
(ii) How does the inconsistency in 6, compare with that when only a single measure is available 
(that is, m = 1)? What happens as m grows? Comment. 


The point of this exercise is to show that tests for functional form cannot be relied on as a general test 
for omitted variables. Suppose that, conditional on the explanatory variables x, and x, a linear model 
relating y to x, and x, satisfies the Gauss-Markov assumptions: 


y = Bo + Bix, + Box, +u 
E(ulx,, x.) = 0 
Var(ulx,, x2) = o°. 
To make the question interesting, assume B, # 0. 
Suppose further that x, has a simple linear relationship with x: 
X, = 6) + ôx tr 
E(7|x,) = 0 


Var(r|x,) = 77. 


(i) Show that 


E(y|x,) = (Bo B280) H (Bı H B281) x}. 


Under random sampling, what is the probability limit of the OLS estimator from the simple regression 

of y on x? Is the simple regression estimator generally consistent for B,? 

(ii) If you run the regression of y on x, x1, what will be the probability limit of the OLS estimator 
of the coefficient on x}? Explain. 

(iii) Using substitution, show that we can write 


y= (Bo H B280) H (Bı H B281)xı +u + Bor. 


It can be shown that, if we define v = u + Bər then E(v|x,) = 0, Var(v[x,) = o? + Br’. What 
consequences does this have for the f statistic on x; from the regression in part (ii)? 

(iv) What do you conclude about adding a nonlinear function of x,—in particular, xj—in an attempt 
to detect omission of x,? 


9 


10 
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Suppose that log(y) follows a linear model with a linear form of heteroskedasticity. We write this as 


log(y) = By + xB + u 
u\x ~ Normal[0,h(x) ], 


so that, conditional on x, u has a normal distribution with mean (and median) zero but with variance 
h(x) that depends on x. Because Med(u\x) = 0, equation (9.48) holds: Med(y|x) = exp(®y + xB). 
Further, using an extension of the result from Chapter 6, it can be shown that 


E(y|x) = exp[By + xB + A(x)/2]. 


(i) | Given that h(x) can be any positive function, is it possible to conclude dE( ylx)/ðx; is the same 
sign as B;? 

(ii) Suppose h(x) = ô + xô (and ignore the problem that linear functions are not necessarily 
always positive). Show that a particular variable, say x,, can have a negative effect on Med(y|x) 
but a positive effect on E(y|x). 

(iii) Consider the case covered in Section 6-4, in which h(x) = o°. How would you predict y 
using an estimate of E(y|x)? How would you predict y using an estimate of Med(y|x)? Which 
prediction is always larger? 


This exercise shows that in a simple regression model, adding a dummy variable for missing data on 
the explanatory variable produces a consistent estimator of the slope coefficient if the “missingness” 
is unrelated to both the unobservable and observable factors affecting y. Let m be a variable such that 
m = 1 if we do not observe x and m = 0 if we observe x. We assume that y is always observed. The 
population model is 


y= Bot Byxtu 


E(ulx) = 0. 
(i) Provide an interpretation of the stronger assumption 
E(ulxym) = 0. 


In particular, what kind of missing data schemes would cause this assumption to fail? 
(ii) | Show that we can always write 


y = Bo + B,C — m)x + Bymx + u. 


(iii) Let (x3 y, m;):i = 1,...,n be random draws from the population, where x; is missing when 
m; = 1. Explain the nature of the variable z; = (1 — m,)x;. In particular, what does this variable 
equal when x; is missing? 

(iv) Let p = P(m = 1) and assume that m and x are independent. Show that 


Cov[(1 — m)x,mx] = —p(1 = p) Me 


where u, = E(x). What does this imply about estimating £, from the regression y; on 
zi = 1,...,n? 
(v) Ifm and x are independent, it can be shown that 


mx = do + ôm + v, 


where v is uncorrelated with m and z = (1 — m)x. Explain why this makes m a suitable proxy 
variable for mx. What does this mean about the coefficient on z; in the regression 


y,;on z, m,i = 1,...,n? 


(vi) Suppose for a population of children, y is a standardized test score, obtained from school 
records, and x is family income, which is reported voluntarily by families (and so some families 
do not report their income). Is it realistic to assume m and x are independent? Explain. 
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11 


G) 


(ii) 


In column (3) of Table 9.2, the coefficient on educ is .018 and it is statistically insignificant, 
and that on JQ is actually negative, —.0009, and also statistically insignificant. Explain what is 
happening. 

What regression might you run that still includes an interaction to make the coefficients on educ 
and JQ more sensible? Explain. 


Computer Exercises 


C1 


G) 


(ii) 


Apply RESET from equation (9.3) to the model estimated in Computer Exercise C5 in 
Chapter 7. Is there evidence of functional form misspecification in the equation? 

Compute a heteroskedasticity-robust form of RESET. Does your conclusion from part (i) 
change? 


C2 Use the data set WAGE2 for this exercise. 


C3 


C4 


C5 


© 
(ii) 


(iii) 


Use the variable KWW (the “knowledge of the world of work” test score) as a proxy for ability 
in place of JQ in Example 9.3. What is the estimated return to education in this case? 

Now, use JQ and KWW together as proxy variables. What happens to the estimated return to 
education? 

In part (ii), are JQ and KWW individually significant? Are they jointly significant? 


Use the data from JTRAIN for this exercise. 


G) 


(ii) 


(iii) 


(iv) 


(v) 


Consider the simple regression model 
log(scrap) = By + Bigrant + u, 


where scrap is the firm scrap rate and grant is a dummy variable indicating whether a firm 
received a job training grant. Can you think of some reasons why the unobserved factors in u 
might be correlated with grant? 

Estimate the simple regression model using the data for 1988. (You should have 54 
observations.) Does receiving a job training grant significantly lower a firm’s scrap rate? 

Now, add as an explanatory variable log(scrapg,). How does this change the estimated effect of 
grant? Interpret the coefficient on grant. Is it statistically significant at the 5% level against the 
one-sided alternative Hy: Berane < 0? 

Test the null hypothesis that the parameter on log(scrapg7) is one against the two-sided 
alternative. Report the p-value for the test. 

Repeat parts (iii) and (iv), using heteroskedasticity-robust standard errors, and briefly discuss 
any notable differences. 


Use the data for the year 1990 in INFMRT for this exercise. 


(i) 


(ii) 


Reestimate equation (9.43), but now include a dummy variable for the observation on the 
District of Columbia (called DC). Interpret the coefficient on DC and comment on its size and 
significance. 

Compare the estimates and standard errors from part (i) with those from equation (9.44). What 
do you conclude about including a dummy variable for a single observation? 


Use the data in RDCHEM to further examine the effects of outliers on OLS estimates and to see how 
LAD is less sensitive to outliers. The model is 


rdintens = By + B,sales + B,sales’ + B3profmarg + u, 


where you should first change sales to be in billions of dollars to make the estimates easier to 
interpret. 


C6 


C7 


c8 


c9 


C10 
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(i) Estimate the above equation by OLS, both with and without the firm having annual sales of 
almost $40 billion. Discuss any notable differences in the estimated coefficients. 

(ii) Estimate the same equation by LAD, again with and without the largest firm. Discuss any 
important differences in estimated coefficients. 

(iii) Based on your findings in parts (i) and (ii), would you say OLS or LAD is more resilient to outliers? 


Redo Example 4.10 by dropping schools where teacher benefits are less than 1% of salary. 
(i) | How many observations are lost? 
Gi) Does dropping these observations have any important effects on the estimated tradeoff? 


Use the data in LOANAPP for this exercise. 

(i) | How many observations have obrat > 40, that is, other debt obligations more than 40% of total 
income? 

(ii) Reestimate the model in part (iii) of Computer Exercise C8 in Chapter 7, excluding 
observations with obrat > 40. What happens to the estimate and ż statistic on white? 

(iii) Does it appear that the estimate of B,,,,;,2 is overly sensitive to the sample used? 


Use the data in TWOYEAR for this exercise. 

(i) The variable stotal is a standardized test variable, which can act as a proxy variable for 
unobserved ability. Find the sample mean and standard deviation of stotal. 

(ii) Run simple regressions of jc and univ on stotal. Are both college education variables 
statistically related to stotal? Explain. 

(ii) Add stotal to equation (4.17) and test the hypothesis that the returns to two- and four-year 
colleges are the same against the alternative that the return to four-year colleges is greater. How 
do your findings compare with those from Section 4-4? 

(iv) Add stotaľ to the equation estimated in part (iii). Does a quadratic in the test score variable 
seem necessary? 

(v) Add the interaction terms stotal-jc and stotal-univ to the equation from part (iii). Are these terms 
jointly significant? 

(vi) What would be your final model that controls for ability through the use of stotal? Justify your 
answer. 


In this exercise, you are to compare OLS and LAD estimates of the effects of 401(k) plan eligibility on 
net financial assets. The model is 


nettfa = By + Binec + Bnin? + Bage + Baage? + Bsmale + Boe401k + u. 


G) Use the data in 401 KSUBS to estimate the equation by OLS and report the results in the usual 
form. Interpret the coefficient on e40/k. 

(ii) Use the OLS residuals to test for heteroskedasticity using the Breusch-Pagan test. Is u 
independent of the explanatory variables? 

(iii) Estimate the equation by LAD and report the results in the same form as for OLS. Interpret the 
LAD estimate of Be. 

(iv) Reconcile your findings from parts (i) and (iii). 


You need to use two data sets for this exercise, JTRAIN2 and JTRAIN3. The former is the outcome 

of a job training experiment. The file JTRAIN3 contains observational data, where individuals them- 

selves largely determine whether they participate in job training. The data sets cover the same time 

period. 

(i) Inthe data set JTRAIN2, what fraction of the men received job training? What is the fraction in 
JTRAIN3? Why do you think there is such a big difference? 

(ii) Using JTRAIN2, run a simple regression of re78 on train. What is the estimated effect of 
participating in job training on real earnings? 
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C11 


C12 


(iii) 


(iv) 


(v) 


(vi) 


(vii) 


Now add as controls to the regression in part (ii) the variables re74, re75, educ, age, black, 

and hisp. Does the estimated effect of job training on re78 change much? How come? (Hint: 
Remember that these are experimental data.) 

Do the regressions in parts (ii) and (iii) using the data in JTRAIN3, reporting only the estimated 
coefficients on train, along with their ¢ statistics. What is the effect now of controlling for the 
extra factors, and why? 

Define avgre = (re74 + re75)/2. Find the sample averages, standard deviations, and minimum 
and maximum values in the two data sets. Are these data sets representative of the same 
populations in 1978? 

Almost 96% of men in the data set JTRAIN2 have avgre less than $10,000. Using only these 
men, run the regression 


re78 on train, re74, re75, educ, age, black, hisp 


and report the training estimate and its ¢ statistic. Run the same regression for JTRAIN3, using 
only men with avgre = 10. For the subsample of low-income men, how do the estimated 
training effects compare across the experimental and nonexperimental data sets? 

Now use each data set to run the simple regression re78 on train, but only for men who were 
unemployed in 1974 and 1975. How do the training estimates compare now? 


(viii) Using your findings from the previous regressions, discuss the potential importance of having 


comparable populations underlying comparisons of experimental and nonexperimental estimates. 


Use the data in MURDER only for the year 1993 for this question, although you will need to first 
obtain the lagged murder rate, say mrdrte_,. 


(i) 


(ii) 


(iii) 


(iv) 


Run the regression of mrdrte on exec, unem. What are the coefficient and f statistic on exec? 
Does this regression provide any evidence for a deterrent effect of capital punishment? 

How many executions are reported for Texas during 1993? (Actually, this is the sum of 
executions for the current and past two years.) How does this compare with the other states? 
Add a dummy variable for Texas to the regression in part (i). Is its ¢ statistic unusually large? 
From this, does it appear Texas is an “outlier”? 

To the regression in part (i) add the lagged murder rate. What happens to £.,.. and its statistical 
significance? 

For the regression in part (iii), does it appear Texas is an outlier? What is the effect on Borec from 
dropping Texas from the regression? 


Use the data in ELEM94_95 to answer this question. See also Computer Exercise C10 in Chapter 4. 


Gv) 


Using all of the data, run the regression lavgsal on bs, lenrol, Istaff, and lunch. Report the 
coefficient on bs along with its usual and heteroskedasticity-robust standard errors. What do you 
conclude about the economic and statistical significance of Bos? 

Now drop the four observations with bs > .5, that is, where average benefits are (supposedly) 
more than 50% of average salary. What is the coefficient on bs? Is it statistically significant 
using the heteroskedasticity-robust standard error? 

Verify that the four observations with bs > .5 are 68, 1,127, 1,508, and 1,670. Define four 
dummy variables for each of these observations. (You might call them d68, d1127, d1508, 

and d1670.) Add these to the regression from part (i) and verify that the OLS coefficients 

and standard errors on the other variables are identical to those in part (ii). Which of the four 
dummies has a f statistic statistically different from zero at the 5% level? 

Verify that, in this data set, the data point with the largest studentized residual (largest f statistic 
on the dummy variable) in part (iii) has a large influence on the OLS estimates. (That is, run 
OLS using all observations except the one with the large studentized residual.) Does dropping, 
in turn, each of the other observations with bs > .5 have important effects? 


C13 


C14 


(v) 


(vi) 
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What do you conclude about the sensitivity of OLS to a single observation, even with a large 
sample size? 

Verify that the LAD estimator is not sensitive to the inclusion of the observation identified in 
part (iii). 


Use the data in CEOSAL2 to answer this question. 


G) 


Gi 


(iii) 
(iv) 


(v) 


Estimate the model 
lsalary = By + B,lsales + B lmktval + B,ceoten + B,ceoter? + u 


by OLS using all of the observations, where /salary, lsales, and Imktval are all natural 
logarithms. Report the results in the usual form with the usual OLS standard errors. (You may 
verify that the heteroskedasticity-robust standard errors are similar.) 

In the regression from part (i) obtain the studentized residuals; call these str; How many 
studentized residuals are above 1.96 in absolute value? If the studentized residuals were 
independent draws from a standard normal distribution, about how many would you expect to 
be above two in absolute value with 177 draws? 

Reestimate the equation in part (i) by OLS using only the observations with |str;| < 1.96. How 
do the coefficients compare with those in part (1)? 

Estimate the equation in part (i) by LAD, using all of the data. Is the estimate of £, closer to the 
OLS estimate using the full sample or the restricted sample? What about for B;? 

Evaluate the following statement: “Dropping outliers based on extreme values of studentized 
residuals makes the resulting OLS estimates closer to the LAD estimates on the full sample.” 


Use the data in ECONMATH to answer this question. The population model is 


(i) 

(ii) 
(iii) 
(iv) 


(v) 


(vi) 


(vii) 


score = By + Bact + u. 


For how many students is the ACT score missing? What is the fraction of the sample? 

Define a new variable, actmiss, which equals one if act is missing, and zero otherwise. 

Create a new variable, say act0, which is the act score when act is reported and zero when act is 
missing. Find the average of acf0 and compare it with the average for act. 

Run the simple regression of score on act using only the complete cases. What do you obtain 
for the slope coefficient and its heteroskedasticity-robust standard error? 

Run the simple regression of score on act0 using all of the cases. Compare the slope coefficient 
with that in part (iii) and comment. 

Now use all of the cases and run the regression 


score; on actO;, actmiss;. 


What is the slope estimate on act0;? How does it compare with the answers in parts (iii) 

and (iv)? 

Comparing regressions in parts (iii) and (v), does using all cases and adding the missing data 
estimator improve estimation of B,? 

If you add the variable colgpa to the regressions in parts (iii) and (v), does this change your 
answer to part (vi)? 


ow that we have a solid understanding of how to use the multiple regression model for 

cross-sectional applications, we can turn to the econometric analysis of time series data. 

Because we will rely heavily on the method of ordinary least squares, most of the work 
concerning mechanics and inference has already been done. However, as we noted in Chapter 1, 
time series data have certain characteristics that cross-sectional data do not, and these can require 
special attention when applying OLS. 

Chapter 10 covers basic regression analysis and gives attention to problems unique to time 
series data. We provide a set of Gauss-Markov and classical linear model assumptions for time 
series applications. The problems of functional form, dummy variables, trends, and seasonality are 
also discussed. 

Because certain time series models necessarily violate the Gauss-Markov assumptions, 
Chapter 11 describes the nature of these violations and presents the large sample properties of 
ordinary least squares. AS we can no longer assume random sampling, we must cover conditions 
that restrict the temporal correlation in a time series in order to ensure that the usual asymptotic 
analysis is valid. 

Chapter 12 turns to an important new problem: serial correlation in the error terms in time series 
regressions. We discuss the consequences, ways of testing, and methods for dealing with serial 
correlation. Chapter 12 also contains an explanation of how heteroskedasticity can arise in time 
series models. 


CHAPTER 1 O 


Basic Regression 
Analysis with Time 
Series Data 


n this chapter, we begin to study the properties of OLS for estimating linear regression models 

using time series data. In Section 10-1, we discuss some conceptual differences between time series 

and cross-sectional data. Section 10-2 provides some examples of time series regressions that are 
often estimated in the empirical social sciences. We then turn our attention to the finite sample prop- 
erties of the OLS estimators and state the Gauss-Markov assumptions and the classical linear model 
assumptions for time series regression. Although these assumptions have features in common with 
those for the cross-sectional case, they also have some significant differences that we will need to 
highlight. 

In addition, we return to some issues that we treated in regression with cross-sectional data, such 
as how to use and interpret the logarithmic functional form and dummy variables. The important top- 
ics of how to incorporate trends and account for seasonality in multiple regression are taken up in 
Section 10-5. 


10-1 The Nature of Time Series Data 


334 


An obvious characteristic of time series data that distinguishes them from cross-sectional data is tem- 
poral ordering. For example, in Chapter 1, we briefly discussed a time series data set on employment, 
the minimum wage, and other economic variables for Puerto Rico. In this data set, we must know that 
the data for 1970 immediately precede the data for 1971. For analyzing time series data in the social 
sciences, we must recognize that the past can affect the future, but not vice versa (unlike in the Star 
Trek universe). To emphasize the proper ordering of time series data, Table 10.1 gives a partial listing 
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TABLE 10.1 Partial Listing of Data on U.S. Inflation and Unemployment Rates, 1948-2017 


Year Inflation Unemployment 
1948 8.1 3.8 
1949 —1.2 5.9 
1950 1.3 5.3 
1951 7.9 3.3 
2012 2.1 8.1 
2013 iE 7.4 
2014 1.6 6.2 
2015 0.1 5.3 
2016 1.3 4.9 
2017 2.1 4.4 


of the data on U.S. inflation and unemployment rates from various editions of the Economic Report of 
the President, including the 2018 Report (Tables B-10 and B-11). 

Another difference between cross-sectional and time series data is more subtle. In Chapters 3 
and 4, we studied statistical properties of the OLS estimators based on the notion that samples were 
randomly drawn from the appropriate population. Understanding why cross-sectional data should be 
viewed as random outcomes is fairly straightforward: a different sample drawn from the population 
will generally yield different values of the independent and dependent variables (such as education, 
experience, wage, and so on). Therefore, the OLS estimates computed from different random samples 
will generally differ, and this is why we consider the OLS estimators to be random variables. 

How should we think about randomness in time series data? Certainly, economic time series 
satisfy the intuitive requirements for being outcomes of random variables. For example, today we do 
not know what the Dow Jones Industrial Average will be at the close of the next trading day. We do 
not know what the annual growth in output will be in Canada during the coming year. Because the 
outcomes of these variables are not foreknown, they should clearly be viewed as random variables. 

Formally, a sequence of random variables indexed by time is called a stochastic process or a 
time series process. (“‘Stochastic” is a synonym for random.) When we collect a time series data set, 
we obtain one possible outcome, or realization, of the stochastic process. We can only see a single 
realization because we cannot go back in time and start the process over again. (This is analogous to 
cross-sectional analysis where we can collect only one random sample.) However, if certain condi- 
tions in history had been different, we would generally obtain a different realization for the stochastic 
process, and this is why we think of time series data as the outcome of random variables. The set of 
all possible realizations of a time series process plays the role of the population in cross-sectional 
analysis. The sample size for a time series data set is the number of time periods over which we 
observe the variables of interest. 


10-2 Examples of Time Series Regression Models 


In this section, we discuss two examples of time series models that have been useful in empirical time 
series analysis and that are easily estimated by ordinary least squares. We will study additional mod- 
els in Chapter 11. 
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10-2a Static Models 


Suppose that we have time series data available on two variables, say y and z, where y, and z, are dated 
contemporaneously. A static model relating y to z is 


y, = Bo + Biz, + upt = 1,2,...,n. [10.1] 


The name “static model” comes from the fact that we are modeling a contemporaneous relationship 
between y and z. Usually, a static model is postulated when a change in z at time f is believed to have 
an immediate effect on y: Ay, = B,Az,, when Au, = 0. Static regression models are also used when 
we are interested in knowing the tradeoff between y and z. 

An example of a static model is the static Phillips curve, given by 


inf, = Bo F Byunem, a Un [10.2] 


where inf, is the annual inflation rate and unem, is the annual unemployment rate. This form of the 
Phillips curve assumes a constant natural rate of unemployment and constant inflationary expecta- 
tions, and it can be used to study the contemporaneous tradeoff between inflation and unemployment. 
[See, for example, Mankiw (1994, Section 11-2).] 

Naturally, we can have several explanatory variables in a static regression model. Let mrdrte, 
denote the murders per 10,000 people in a particular city during year t, let convrte, denote the murder 
conviction rate, let unem, be the local unemployment rate, and let yngmle, be the fraction of the popu- 
lation consisting of males between the ages of 18 and 25. Then, a static multiple regression model 
explaining murder rates is 


mrdrte, = By + B,convrte, + B,unem, + Bryngmle, + u, [10.3] 


Using a model such as this, we can hope to estimate, for example, the ceteris paribus effect of an 
increase in the conviction rate on a particular criminal activity. 


10-2b Finite Distributed Lag Models 


In a finite distributed lag (FDL) model, we allow one or more variables to affect y with a lag. For 
example, for annual observations, consider the model 


fr, = Qo + Spe, + ipe,- + Ôpe, -2 + u, [10.4] 


where gfr, is the general fertility rate (children born per 1,000 women of childbearing age) and pe, is 
the real dollar value of the personal tax exemption. The idea is to see whether, in the aggregate, the 
decision to have children is linked to the tax value of having a child. Equation (10.4) recognizes that, 
for both biological and behavioral reasons, decisions to have children would not immediately result 
from changes in the personal exemption. 

Equation (10.4) is an example of the model 


Y, = Ay + oz, + iZ -1 + ÖZ- + Up [10.5] 


which is an FDL of order two. To interpret the coefficients in (10.5), suppose that z is a constant, 
equal to c, in all time periods before time t. At time t, z increases by one unit to c + 1 and then reverts 
to its previous level at time t + 1. (That is, the increase in z is temporary.) More precisely, 


a HOB Hy SH OH B41 HOB. Hy... 
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To focus on the ceteris paribus effect of z on y, we set the error term in each time period to 
zero. Then, 


Y,-1 = Ag + doc + dic + ôC, 

y, = A + Solc + 1) + dc + êc, 
Vin, = A + Soc + êle + 1) + Se, 
V4. = A + Soc + Sic + 6(e + 1), 


Yi+3 = Ay + doc + dic + ôC, 


and so on. From the first two equations, y, — y,_; = 69, which shows that 6, is the immediate change 
in y due to the one-unit increase in z at time t. Usually, ô is called the impact propensity or impact 
multiplier. 

Similarly, 6, = y,,,; — y,-, is the change in y one period after the temporary change and 
ô = Y,+2 — Y;,—1 is the change in y two periods after the change. At time t + 3, y has reverted back 
to its initial level: y,,, = y,_,. This is because we have assumed that only two lags of z appear in 
(10.5). When we graph the 6; as a function of j, we obtain the lag distribution, which summarizes the 
dynamic effect that a temporary increase in z has on y. A possible lag distribution for the FDL of order 
two is given in Figure 10.1. (Of course, we would never know the parameters 6;; instead, we will esti- 
mate the 6; and then plot the estimated lag distribution.) 

The lag distribution in Figure 10.1 implies that the largest effect is at the first lag. The lag distri- 
bution has a useful interpretation. If we standardize the initial value of y at y,_, = 0, the lag distribu- 
tion traces out all subsequent values of y due to a one-unit, temporary increase in z. 

We are also interested in the change in y due to a permanent increase in z. Before time t, z equals 
the constant c. At time ¢, z increases permanently toc + l:z,=c,s<tandz,=ct+1,s=t. 
Again, setting the errors to zero, we have 


Y,-1 = A + doc + c + ôC, 
Ye = ay + ole + 1) + Sic + êc, 
Vit, = A + elc + 1) + 6(e + 1) + Sse, 
Y2 = A + colc + 1) + êle + 1) + 6(e + 1), 


FIGURE 10.1 A lag distribution with two nonzero lags. The maximum effect is at the first lag. 


coefficient 
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lag 
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and so on. With the permanent increase in z, after one period, y has increased by 6, + 6,, and after 
two periods, y has increased by 6) + 6, + ô. There are no further changes in y after two periods. 
This shows that the sum of the coefficients on current and lagged z, 6) + 6, + ô, is the long-run 
change in y given a permanent increase in z and is called the long-run propensity (LRP) or long-run 
multiplier. The LRP is often of interest in distributed lag models. 

As an example, in equation (10.4), 6) measures the immediate change in fertility due to a one- 
dollar increase in pe. As we mentioned earlier, there are reasons to believe that 5p is small, if not zero. 
But 6, or ô, or both, might be positive. If pe permanently increases by one dollar, then, after two 
years, gfr will have changed by 6) + 6, + ô. This model assumes that there are no further changes 
after two years. Whether this is actually the case is an empirical matter. 

An FDL of order q is written as 


Ye = Og H Goze Oili ee + Og + Up [10.6] 


This contains the static model as a special case by setting 6), 65, . . . , 6, equal to zero. Sometimes, a 
primary purpose for estimating a distributed lag model is to test whether z has a lagged effect on y. 
The impact propensity is always the coefficient on the contemporaneous z, dy. Occasionally, we omit 
zı from (10.6), in which case the impact propensity is zero. In the general case, the lag distribution 
can be plotted by graphing the (estimated) 6; as a function of j. For any horizon h, we can define 
the cumulative effect as 6) + 6, +-+- + 6,, which is interpreted as the change in the expected 
outcome h periods after a permanent, one-unit increase in x. Once the 6; have been estimated, 
one may plot the estimated cumulative effects as a function of h. The LRP is the cumula- 
tive effect after all changes have taken place; it is simply the sum of all of the coefficients 
on the z 


t-j 


LRP = & + ô+ +8 [10.7] 


q 


Because of the often substantial correlation in z at 
GOING FURTHER 10.1 different lags—that is, due to multicollinearity in 
In an equation for annual data, suppose that | (10.6)—it can be difficult to obtain precise esti- 
int, = 1.6 + 48 inf, — 15 inf; mates of the individual 6). Interestingly, even when 
the 6; cannot be precisely estimated, we can often get 
good estimates of the LRP. We will see an example 
where int is an interest rate and inf is the | later. 
inflation rate. What are the impact and long- We can have more than one explanatory variable 
run propensities? appearing with lags, or we can add contemporaneous 
variables to an FDL model. For example, the average 
education level for women of childbearing age could be added to (10.4), which allows us to account 
for changing education levels for women. 


eee eee 


10-2c A Convention about the Time Index 


When models have lagged explanatory variables (and, as we will see in Chapter 11, for models with 
lagged y), confusion can arise concerning the treatment of initial observations. For example, if in 
(10.5) we assume that the equation holds starting at t = 1, then the explanatory variables for the first 
time period are z,, Zo, and z_,. Our convention will be that these are the initial values in our sample, 
so that we can always start the time index at t = 1. In practice, this is not very important because 
regression packages automatically keep track of the observations available for estimating models with 
lags. But for this and the next two chapters, we need some convention concerning the first time period 
being represented by the regression equation. 
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10-3 Finite Sample Properties of OLS under Classical Assumptions 


In this section, we give a complete listing of the finite sample, or small sample, properties of OLS 
under standard assumptions. We pay particular attention to how the assumptions must be altered from 
our cross-sectional analysis to cover time series regressions. 


10-3a Unbiasedness of OLS 


The first assumption simply states that the time series process follows a model that is linear in its 
parameters. 


Assumption TS.1 Linear in Parameters 


The stochastic process Xa Xa, ..- sXe Vit = 1, 2,..., 1} follows the linear model 


Yt = Bo + Bix + F BX t Un [10.8] 


where {ut = 1, 2,...,n} is the sequence of errors or disturbances. Here, n is the number of 
observations (time periods). 


In the notation Xj t denotes the time period, and j is, as usual, a label to indicate one of the 
k explanatory variables. The terminology used in cross-sectional regression applies here: y, is the 
dependent variable, explained variable, or regressand; the x, are the independent variables, explana- 
tory variables, or regressors. 

We should think of Assumption TS.1 as being essentially the same as Assumption MLR.1 (the 
first cross-sectional assumption), but we are now specifying a linear model for time series data. The 
examples covered in Section 10-2 can be cast in the form of (10.8) by appropriately defining x,. For 
example, equation (10.5) is obtained by setting x; = Zp X2 = Z1 and x,3 = Z,—. 

To state and discuss several of the remaining assumptions, we let x, = (x1, X - - - , X) denote 
the set of all independent variables in the equation at time t. Further, X denotes the collection of all 
independent variables for all time periods. It is useful to think of X as being an array, with n rows 
and k columns. This reflects how time series data are stored in econometric software packages: the 
t" row of X is x,, consisting of all independent variables for time period t. Therefore, the first row of 
X corresponds to t = 1, the second row to t = 2, and the last row to t = n. An example is given in 
Table 10.2, using n = 8 and the explanatory variables in equation (10.3). 


TABLE 10.2 Example of X for the Explanatory Variables in Equation (10.3) 


t convrte unem yngmle 
1 A6 074 12 
2 42 .071 12 
3 42 .063 ali 
4 AT .062 .09 
5 .48 .060 10 
6 .50 .059 .11 
7 155 .058 .12 
8 56 .059 13 
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Naturally, as with cross-sectional regression, we need to rule out perfect collinearity among the 
regressors. 


Assumption TS.2 No Perfect Collinearity 


In the sample (and therefore in the underlying time series process), no independent variable is constant 
nor a perfect linear combination of the others. 


We discussed this assumption at length in the context of cross-sectional data in Chapter 3. The 
issues are essentially the same with time series data. Remember, Assumption TS.2 does allow the 
explanatory variables to be correlated, but it rules out perfect correlation in the sample. 

The final assumption for unbiasedness of OLS is the time series analog of Assumption MLR.4, 
and it also obviates the need for random sampling in Assumption MLR.?. 


Assumption TS.3 Zero Conditional Mean 


For each t, the expected value of the error u, given the explanatory variables for all time periods, is 
zero. Mathematically, 


E(u |X) =0,t=1,2,...,9. [10.9] 


This is a crucial assumption, and we need to have an intuitive grasp of its meaning. As in the cross- 
sectional case, it is easiest to view this assumption in terms of uncorrelatedness: Assumption TS.3 
implies that the error at time f, u, is uncorrelated with each explanatory variable in every time period. 
The fact that this is stated in terms of the conditional expectation means that we must also correctly 
specify the functional relationship between y, and the explanatory variables. If u, is independent of X 
and E(u,) = 0, then Assumption TS.3 automatically holds. 

Given the cross-sectional analysis from Chapter 3, it is not surprising that we require u, to be 
uncorrelated with the explanatory variables also dated at time t: in conditional mean terms, 


E(ulxy,..- 5%) = E(uJx,) = 0. [10.10] 


When (10.10) holds, we say that the x, are contemporaneously exogenous. Equation (10.10) implies 
that u, and the explanatory variables are contemporaneously uncorrelated: Corr(x,,u,) = 0, for all j. 

Assumption TS.3 requires more than contemporaneous exogeneity: u, must be uncorrelated with 
X even when s # t. This is a strong sense in which the explanatory variables must be exogenous, and 
when TS.3 holds, we say that the explanatory variables are strictly exogenous. In Chapter 11, we will 
demonstrate that (10.10) is sufficient for proving consistency of the OLS estimator. But to show that 
OLS is unbiased, we need the strict exogeneity assumption. 

In the cross-sectional case, we did not explicitly state how the error term for, say, person i, u,, 
is related to the explanatory variables for other people in the sample. This was unnecessary because 
with random sampling (Assumption MLR.2), u; is automatically independent of the explanatory vari- 
ables for observations other than i. In a time series context, random sampling is almost never appro- 
priate, so we must explicitly assume that the expected value of u, is not related to the explanatory 
variables in any time periods. 

It is important to see that Assumption TS.3 puts no restriction on correlation in the independent 
variables or in the u, across time. Assumption TS.3 only says that the average value of u, is unrelated 
to the independent variables in all time periods. 


THEOREM 
10.1 
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Anything that causes the unobservables at time f to be correlated with any of the explanatory 
variables in any time period causes Assumption TS.3 to fail. Two leading candidates for failure are 
omitted variables and measurement error in some of the regressors. But the strict exogeneity assump- 
tion can also fail for other, less obvious reasons. In the simple static regression model 


y5 Bo + Biz, + Ur, 


Assumption TS.3 requires not only that u, and z, are uncorrelated, but that u, is also uncorrelated with 
past and future values of z. This has two implications. First, z can have no lagged effect on y. If z 
does have a lagged effect on y, then we should estimate a distributed lag model. A more subtle point 
is that strict exogeneity excludes the possibility that changes in the error term today can cause future 
changes in z. This effectively rules out feedback from y to future values of z. For example, consider a 
simple static model to explain a city’s murder rate in terms of police officers per capita: 


mrdrte, = By + B,polpc, + u,. 


It may be reasonable to assume that u, is uncorrelated with polpc, and even with past values of polpc,; 
for the sake of argument, assume this is the case. But suppose that the city adjusts the size of its police 
force based on past values of the murder rate. This means that, say, polpc,,, might be correlated 
with u, (because a higher u, leads to a higher mrdrte,). If this is the case, Assumption TS.3 is generally 
violated. 

There are similar considerations in distributed lag models. Usually, we do not worry that u, might 
be correlated with past z because we are controlling for past z in the model. But feedback from u to 
future z is always an issue. 

Explanatory variables that are strictly exogenous cannot react to what has happened to y in 
the past. A factor such as the amount of rainfall in an agricultural production function satisfies this 
requirement: rainfall in any future year is not influenced by the output during the current or past 
years. But something like the amount of labor input might not be strictly exogenous, as it is chosen by 
the farmer, and the farmer may adjust the amount of labor based on last year’s yield. Policy variables, 
such as growth in the money supply, expenditures on welfare, and highway speed limits, are often 
influenced by what has happened to the outcome variable in the past. In the social sciences, many 
explanatory variables may very well violate the strict exogeneity assumption. 

Even though Assumption TS.3 can be unrealistic, we begin with it in order to conclude that the OLS 
estimators are unbiased. Most treatments of static and FDL models assume TS.3 by making the stronger 
assumption that the explanatory variables are nonrandom, or fixed in repeated samples. The nonrandom- 
ness assumption is obviously false for time series observations; Assumption TS.3 has the advantage of 
being more realistic about the random nature of the x,, while it isolates the necessary assumption about 
how u, and the explanatory variables are related in order for OLS to be unbiased. 


UNBIASEDNESS OF OLS 


Under Assumptions TS.1, TS.2, and TS.3, the OLS estimators are unbiased conditional on X, and 
therefore unconditionally as well when the expectations exist: E(B) = Bad = OA soi ak 


The proof of this theorem is essentially the same 

GOING FURTHER 10.2 as that for Theorem 3.1 in Chapter 3, and so we omit 
Inthe FDL model y, = æg + oZ; + êZ- 4 it. When comparing Theorem 10.1 to Theorem 3.1, 
Us What do we need to assume about | we have been able to drop the random sampling 
the sequence {Zp, Z;,...,Zpt in order for | assumption by assuming that, for each 1, u, has zero 
Assumption 18.3 to hold? mean given the explanatory variables at all time peri- 
ods. If this assumption does not hold, OLS cannot be 
shown to be unbiased. 
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The analysis of omitted variables bias, which we covered in Section 3-3, is essentially the same 
in the time series case. In particular, Table 3.2 and the discussion surrounding it can be used as before 
to determine the directions of bias due to omitted variables. 


10-3b The Variances of the OLS Estimators and the Gauss-Markov 
Theorem 


We need to add two assumptions to round out the Gauss-Markov assumptions for time series regres- 
sions. The first one is familiar from cross-sectional analysis. 


Assumption TS.4 Homoskedasticity 


Conditional on X the variance of u; is the same for all t: Var(u;|X) = Var(u,) = o?,t = 1,2,...,/. 


This assumption means that Var(u,X) cannot depend on X—it is sufficient that u, and X are 
independent—and that Var(u,) is constant over time. When TS.4 does not hold, we say that the errors 
are heteroskedastic, just as in the cross-sectional case. For example, consider an equation for deter- 
mining three-month T-bill rates (i3,) based on the inflation rate (inf,) and the federal deficit as a per- 
centage of gross domestic product (def,): 


i3, = Bo + Byinf, + Bodef, + u,. [10.11] 


Among other things, Assumption TS.4 requires that the unobservables affecting interest rates have 
a constant variance over time. Because policy regime changes are known to affect the variability of 
interest rates, this assumption might very well be false. Further, it could be that the variability in inter- 
est rates depends on the level of inflation or relative size of the deficit. This would also violate the 
homoskedasticity assumption. 

When Var(u,{X) does depend on X, it often depends on the explanatory variables at time t, x,. In 
Chapter 12, we will see that the tests for heteroskedasticity from Chapter 8 can also be used for time 
series regressions, at least under certain assumptions. 

The final Gauss-Markov assumption for time series analysis is new. 


Assumption TS.5 No Serial Correlation 


Conditional on X, the errors in two different time periods are uncorrelated: Corr(u;,Us|X) = O, for 
allt # s. 


The easiest way to think of this assumption is to ignore the conditioning on X. Then, 
Assumption TS.5 is simply 


Corr(u,,u,) = 0, for allt # s. [10.12] 


(This is how the no serial correlation assumption is stated when X is treated as nonrandom.) When 
considering whether Assumption TS.5 is likely to hold, we focus on equation (10.12) because of its 
simple interpretation. 

When (10.12) is false, we say that the errors in (10.8) suffer from serial correlation, or auto- 
correlation, because they are correlated across time. Consider the case of errors from adjacent time 
periods. Suppose that when u,_,; > 0 then, on average, the error in the next time period, u, is also 
positive. Then, Corr(u,,u,_,;) > 0, and the errors suffer from serial correlation. In equation (10.11), 
this means that if interest rates are unexpectedly high for this period, then they are likely to be above 


THEOREM 
10.2 


THEOREM 
10.3 


THEOREM 
10.4 


CHAPTER 10 Basic Regression Analysis with Time Series Data 343 


average (for the given levels of inflation and deficits) for the next period. This turns out to be a 
reasonable characterization for the error terms in many time series applications, which we will see in 
Chapter 12. For now, we assume TS.5. 

Importantly, Assumption TS.5 assumes nothing about temporal correlation in the independent 
variables. For example, in equation (10.11), inf, is almost certainly correlated across time. But this has 
nothing to do with whether TS.5 holds. 

A natural question that arises is: in Chapters 3 and 4, why did we not assume that the errors for 
different cross-sectional observations are uncorrelated? The answer comes from the random sampling 
assumption: under random sampling, u; and u, are independent for any two observations i and A. 
It can also be shown that, under random sampling, the errors for different observations are inde- 
pendent conditional on the explanatory variables in the sample. Thus, for our purposes, we consider 
serial correlation only to be a potential problem for regressions with time series data. (In Chapters 13 
and 14, the serial correlation issue will come up in connection with panel data analysis.) 

Assumptions TS.1 through TS.5 are the appropriate Gauss-Markov assumptions for time series 
applications, but they have other uses as well. Sometimes, TS.1 through TS.5 are satisfied in cross- 
sectional applications, even when random sampling is not a reasonable assumption, such as when the 
cross-sectional units are large relative to the population. Suppose that we have a cross-sectional data 
set at the city level. It might be that correlation exists across cities within the same state in some of 
the explanatory variables, such as property tax rates or per capita welfare payments. Correlation of the 
explanatory variables across observations does not cause problems for verifying the Gauss-Markov 
assumptions, provided the error terms are uncorrelated across cities. However, in this chapter, we are 
primarily interested in applying the Gauss-Markov assumptions to time series regression problems. 


OLS SAMPLING VARIANCES 


Under the time series Gauss-Markov Assumptions TS.1 through TS.5, the variance of Ê; conditional 
on X, is 


Var(B|X) = o°/[SST(1 — R?)],j=1,...,k, [10.13] 


where SST, is the total sum of squares of x, and R? is the R-squared from the regression of x; on the 
other independent variables. 


Equation (10.13) is the same variance we derived in Chapter 3 under the cross-sectional Gauss- 
Markov assumptions. Because the proof is very similar to the one for Theorem 3.2, we omit it. The 
discussion from Chapter 3 about the factors causing large variances, including multicollinearity 
among the explanatory variables, applies immediately to the time series case. 

The usual estimator of the error variance is also unbiased under Assumptions TS.1 through TS.5, 
and the Gauss-Markov Theorem holds. 


UNBIASED ESTIMATION OF o° 


Under Assumptions TS.1 through TS.5, the estimator 6? = SSR/df is an unbiased estimator of o°, 


where df =n —k — 1. 


GAUSS-MARKOV THEOREM 


Under Assumptions TS.1 through TS.5, the OLS estimators are the best linear unbiased estimators 
conditional on X. 
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The bottom line here is that OLS has the same 
GOING FURTHER 10.3 desirable finite sample properties under TS.1 through 


Inthe FDL model y, = ay + oZ; + êZ] 4 TS.5 that it has under MLR.1 through MLR.5. 
U, explain the nature of any multicollinearity 
in the explanatory variables. 


10-3c Inference under the Classical Linear Model Assumptions 


In order to use the usual OLS standard errors, ¢ statistics, and F statistics, we need to add a final 
assumption that is analogous to the normality assumption we used for cross-sectional analysis. 


Assumption TS.6 Normality 


The errors u; are independent of X and are independently and identically distributed as Normal(0,c7). 


THEOREM 
10.5 


Assumption TS.6 implies TS.3, TS.4, and TS.5, but it is stronger because of the independence 
and normality assumptions. 


NORMAL SAMPLING DISTRIBUTIONS 


Under Assumptions TS.1 through TS.6, the CLM assumptions for time series, the OLS estimators are 
normally distributed, conditional on X. Further, under the null hypothesis, each t statistic has a t distri- 
bution, and each F statistic has an F distribution. The usual construction of confidence intervals is also 
valid. 


The implications of Theorem 10.5 are of utmost importance. It implies that, when Assumptions 
TS.1 through TS.6 hold, everything we have learned about estimation and inference for cross-sectional 
regressions applies directly to time series regressions. Thus, f statistics can be used for testing statistical 
significance of individual explanatory variables, and F statistics can be used to test for joint significance. 

Just as in the cross-sectional case, the usual inference procedures are only as good as the underlying 
assumptions. The classical linear model assumptions for time series data are much more restrictive than 
those for cross-sectional data—in particular, the strict exogeneity and no serial correlation assumptions 
can be unrealistic. Nevertheless, the CLM framework is a good starting point for many applications. 


Static Phillips Curve 


To determine whether there is a tradeoff, on average, between unemployment and inflation, we can 
test Hy: 8B, = O against H,: B,; < 0 in equation (10.2). If the classical linear model assumptions hold, 
we can use the usual OLS f statistic. 

We use the file PHILLIPS to estimate equation (10.2), restricting ourselves to the data through 
2006. (In later exercises, for example, Computer Exercise C12 and Computer Exercise C10 in 
Chapter 11 you are asked to use all years through 2017. In Chapter 18, we use the years 2007 through 
2017 in various forecasting exercises.) The simple regression estimates are 


iin 
inf, = 1.01 + .505 unem, 
(1.49) (.257) [10.14] 
n = 59, R? = .065, R? = .049. 
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This equation does not suggest a tradeoff between unem and inf: B | > 0. The ¢ statistic for Êi is about 
1.96, which gives a p-value against a two-sided alternative of about .055. Thus, if anything, there is a 
positive relationship between inflation and unemployment. 

There are some problems with this analysis that we cannot address in detail now. In Chapter 12, 
we will see that the CLM assumptions do not hold. In addition, the static Phillips curve is probably 
not the best model for determining whether there is a short-run tradeoff between inflation and unem- 
ployment. Macroeconomists generally prefer the expectations augmented Phillips curve, a simple 
example of which is given in Chapter 11. 


As a second example, we estimate equation (10.11) using annual data on the U.S. economy. 


Effects of Inflation and Deficits on Interest Rates 


The data in INTDEF come from the 2004 Economic Report of the President (Tables B-73 and B-79) 
and span the years 1948 through 2003. The variable i3 is the three-month T-bill rate, inf is the annual 
inflation rate based on the consumer price index (CPI), and def is the federal budget deficit as a per- 
centage of GDP. The estimated equation is 


23, = 1.73 + .606 inf, + .513 def, 
(0.43) (082) (118) [10.15] 
n = 56, R? = 602, R? = 587. 


These estimates show that increases in inflation or the relative size of the deficit increase short-term 
interest rates, both of which are expected from basic economics. For example, a ceteris paribus one 
percentage point increase in the inflation rate increases i3 by .606 points. Both inf and def are very 
statistically significant, assuming, of course, that the CLM assumptions hold. 


10-4 Functional Form, Dummy Variables, and Index Numbers 


All of the functional forms we learned about in earlier chapters can be used in time series regressions. 
The most important of these is the natural logarithm: time series regressions with constant percentage 
effects appear often in applied work. 


Puerto Rican Employment and the Minimum Wage 


Annual data on the Puerto Rican employment rate, minimum wage, and other variables are used by 
Castillo-Freeman and Freeman (1992) to study the effects of the U.S. minimum wage on employment 
in Puerto Rico. A simplified version of their model is 


log(prepop,) = By + B,log(mincov,) + Bolog(usgnp,) + u, [10.16] 


where prepop, is the employment rate in Puerto Rico during year t (ratio of those working to total 
population), usgnp, is real U.S. gross national product (in billions of dollars), and mincov mea- 
sures the importance of the minimum wage relative to average wages. In particular, mincov = 
(avgmin/avgwage)-avgcov, where avgmin is the average minimum wage, avgwage is the average 
overall wage, and avgcov is the average coverage rate (the proportion of workers actually covered by 
the minimum wage law). 
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Using the data in PRMINWGE for the years 1950 through 1987 gives 


a a 
log(prepop,) = —1.05 — .154 log(mincov,) — .012 log(usgnp,) 
(0.77) (.065) (.089) [10.17] 
n = 38, R? = .661, R? = .641. 


The estimated elasticity of prepop with respect to mincov is —.154, and it is statistically significant 
with tf = —2.37. Therefore, a higher minimum wage lowers the employment rate, something that 
classical economics predicts. The GNP variable is not statistically significant, but this changes when 
we account for a time trend in the next section. 


We can use logarithmic functional forms in distributed lag models, too. For example, for quar- 
terly data, suppose that money demand (M,) and gross domestic product (GDP,) are related by 


log(M,) = ap + Splog(GDP,) + 6,log(GDP,_,) + 5,log(GDP,_>) 
+ 6;log(GDP,_;) + 6log(GDP,_4) + u,. 


The impact propensity in this equation, do, is also called the short-run elasticity: it measures 
the immediate percentage change in money demand given a 1% increase in GDP. The LRP, 
dy + ô +- + 64, is sometimes called the long-run elasticity: it measures the percentage increase 
in money demand after four quarters given a permanent 1% increase in GDP. 

Binary or dummy independent variables are also quite useful in time series applications. Because 
the unit of observation is time, a dummy variable represents whether, in each time period, a certain 
event has occurred. For example, for annual data, we can indicate in each year whether a Democrat 
or a Republican is president of the United States by defining a variable democ,, which is unity if the 
president is a Democrat, and zero otherwise. Or, in looking at the effects of capital punishment on 
murder rates in Texas, we can define a dummy variable for each year equal to one if Texas had capital 
punishment during that year, and zero otherwise. 

Often, dummy variables are used to isolate certain periods that may be systematically different 
from other periods covered by a data set. 


EXAMPLE 10.4 Effects of Personal Exemption on Fertility Rates 


The general fertility rate (gfr) is the number of children born to every 1,000 women of childbearing 
age. For the years 1913 through 1984, the equation, 


afr, = Bo + Pipe: + Poww2, + P3pill, + Up, 
explains gfr in terms of the average real dollar value of the personal tax exemption (pe) and two 
binary variables. The variable ww2 takes on the value unity during the years 1941 through 1945, when 
the United States was involved in World War II. The variable pill is unity from 1963 onward, when the 
birth control pill was made available for contraception. 
Using the data in FERTIL3, which were taken from the article by Whittington, Alm, and Peters 
(1990) 


— 


ofr, = 98.68 + .083 pe, — 24.24 ww2, — 31.59 pill, 
(3.21) (.030) (7.46) (4.08) [10.18] 
n = 72, R? = 473, R = .450. 
Each variable is statistically significant at the 1% level against a two-sided alternative. We see that the fertil- 
ity rate was lower during World War II: given pe, there were about 24 fewer births for every 1,000 women 


of childbearing age, which is a large reduction. (From 1913 through 1984, gfr ranged from about 65 to 
127.) Similarly, the fertility rate has been substantially lower since the introduction of the birth control pill. 
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The variable of economic interest is pe. The average pe over this time period is $100.40, ranging 
from zero to $243.83. The coefficient on pe implies that a $12.00 increase in pe increases gfr by about 
one birth per 1,000 women of childbearing age. This effect is hardly trivial. 

In Section 10-2, we noted that the fertility rate may react to changes in pe with a lag. Estimating 
a distributed lag model with two lags gives 


ef, = 95.87 + .073 pe, — .0058 pe,_; + .034 pe, — 22.12 ww2, — 31.30 pill, 
(3.28) (.126) (1557) (.126) (10.73) (3.98) 10.19] 
n = 70, R? = .499, R? = .459. 


In this regression, we only have 70 observations because we lose two when we lag pe twice. The coef- 
ficients on the pe variables are estimated very imprecisely, and each one is individually insignificant. 
It turns out that there is substantial correlation between pe, pe,_;, and pe,—2, and this multicollinearity 
makes it difficult to estimate the effect at each lag. However, pe, pe,-;, and pe,- are jointly signifi- 
cant: the F statistic has a p-value = .012. Thus, pe does have an effect on gfr [as we already saw in 
(10.18)], but we do not have good enough estimates to determine whether it is contemporaneous or 
with a one- or two-year lag (or some of each). Actually, pe,_, and pe,- are jointly insignificant in 
this equation (p-value = .95), so at this point, we would be justified in using the static model. But for 
illustrative purposes, let us obtain a confidence interval for the LRP in this model. 

The estimated LRP in (10.19) is .073 — .0058 + .034 ~ .101. However, we do not have enough 
information in (10.19) to obtain the standard error of this estimate. To obtain the standard error of the 
estimated LRP, we use the trick suggested in Section 4-4. Let 0) = 6) + 6, + 6, denote the LRP and 
write ô in terms of 69, 6,, and ô, as ôo = 0o — ô; — 55. Next, substitute for 6) in the model 


aff, = Ay + Sope, + pe, + dope,-9 + + 


to get 
afr, = ay + (Oo — 5, — 6,)pe, + ôipe, + S,pe,2 ++ 
= ay + Oope, + 5;(pe,-, — pe;) + 8y(pe,-2 — pe,) + 
From this last equation, we can obtain ĝo and its standard error by regressing gfr, on pe, (pe, — pe,), 


(pe,-» — pe,), ww2,, and pill,. The coefficient and associated standard error on pe, are what we 
need. Running this regression gives ĝo = -101 as the coefficient on pe, (as we already knew) and 
se(O)) = = .030 [which we could not compute from (10.19)]. Therefore, the f statistic for 60 i is about 
3.37, so Aoi is statistically different from zero at small significance levels. Even though none of the ô, 
is individually significant, the LRP is very significant. The 95% confidence interval for the LRP is 
about .041 to .160. 


Whittington, Alm, and Peters (1990) allow for further lags but restrict the coefficients to help 
alleviate the multicollinearity problem that hinders estimation of the individual ô. (See Problem 6 for 
an example of how to do this.) For estimating the LRP, which would seem to be of primary interest 
here, such restrictions are unnecessary. Whittington, Alm, and Peters also control for additional vari- 
ables, such as average female wage and the unemployment rate. 


Binary explanatory variables are the key component in what is called an event study. In an event 
study, the goal is to see whether a particular event influences some outcome. Economists who study 
industrial organization have looked at the effects of certain events on firm stock prices. For example, 
Rose (1985) studied the effects of new trucking regulations on the stock prices of trucking companies. 

A simple version of an equation used for event studies is 


Ri = Bo + BiR” + Bod, + u, 
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where R’ is the stock return for firm f during period t (usually a week or a month), R’”" is the market 
return (usually computed for a broad stock market index), and d, is a dummy variable indicating when 
the event occurred. For example, if the firm is an airline, d, might denote whether the airline experi- 
enced a publicized accident or near accident during week t. Including R? in the equation controls for 
the possibility that broad market movements might coincide with airline accidents. Sometimes, mul- 
tiple dummy variables are used. For example, if the event is the imposition of a new regulation that 
might affect a certain firm, we might include a dummy variable that is one for a few weeks before the 
regulation was publicly announced and a second dummy variable for a few weeks after the regulation 
was announced. The first dummy variable might detect the presence of inside information. 

Before we give an example of an event study, we need to discuss the notion of an index num- 
ber and the difference between nominal and real economic variables. An index number typically 
aggregates a vast amount of information into a single quantity. Index numbers are used regularly 
in time series analysis, especially in macroeconomic applications. An example of an index num- 
ber is the index of industrial production (IIP), computed monthly by the Board of Governors of 
the Federal Reserve. The IIP is a measure of production across a broad range of industries, and, 
as such, its magnitude in a particular year has no quantitative meaning. In order to interpret the 
magnitude of the IIP, we must know the base period and the base value. In the 1997 Economic 
Report of the President (ERP), the base year is 1987, and the base value is 100. (Setting IIP to 100 
in the base period is just a convention; it makes just as much sense to set IIP = 1 in 1987, and some 
indexes are defined with 1 as the base value.) Because the IIP was 107.7 in 1992, we can say that 
industrial production was 7.7% higher in 1992 than in 1987. We can use the IIP in any two years 
to compute the percentage difference in industrial output during those two years. For example, 
because IIP = 61.4 in 1970 and HP = 85.7 in 1979, industrial production grew by about 39.6% 
during the 1970s. 

It is easy to change the base period for any index number, and sometimes we must do this to give 
index numbers reported with different base years a common base year. For example, if we want to 
change the base year of the IIP from 1987 to 1982, we simply divide the IIP for each year by the 1982 
value and then multiply by 100 to make the base period value 100. Generally, the formula is 


newindex, = 100(oldindex,/oldindeXpeypase)> [10.20] 


where oldindeXpeypase 18 the original value of the index in the new base year. For example, with base 
year 1987, the IIP in 1992 is 107.7; if we change the base year to 1982, the IIP in 1992 becomes 
100(107.7/81.9) = 131.5 (because the IIP in 1982 was 81.9). 

Another important example of an index number is a price index, such as the CPI. We already 
used the CPI to compute annual inflation rates in Example 10.1. As with the industrial production 
index, the CPI is only meaningful when we compare it across different years (or months, if we are 
using monthly data). In the 1997 ERP, CPI = 38.8 in 1970 and CPI = 130.7 in 1990. Thus, the gen- 
eral price level grew by almost 237% over this 20-year period. (In 1997, the CPI is defined so that its 
average in 1982, 1983, and 1984 equals 100; thus, the base period is listed as 1982-1984.) 

In addition to being used to compute inflation rates, price indexes are necessary for turning a time 
series measured in nominal dollars (or current dollars) into real dollars (or constant dollars). Most 
economic behavior is assumed to be influenced by real, not nominal, variables. For example, classical 
labor economics assumes that labor supply is based on the real hourly wage, not the nominal wage. 
Obtaining the real wage from the nominal wage is easy if we have a price index such as the CPI. We 
must be a little careful to first divide the CPI by 100, so that the value in the base year is 1. Then, 
if w denotes the average hourly wage in nominal dollars and p = CPI/100, the real wage is simply 
w/p. This wage is measured in dollars for the base period of the CPI. For example, in Table B-45 in 
the 1997 ERP, average hourly earnings are reported in nominal terms and in 1982 dollars (which 
means that the CPI used in computing the real wage had the base year 1982). This table reports that 
the nominal hourly wage in 1960 was $2.09, but measured in 1982 dollars, the wage was $6.79. The 
real hourly wage had peaked in 1973, at $8.55 in 1982 dollars, and had fallen to $7.40 by 1995. Thus, 
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there was a nontrivial decline in real wages over those 22 years. (If we compare nominal wages from 
1973 and 1995, we get a very misleading picture: $3.94 in 1973 and $11.44 in 1995. Because the real 
wage fell, the increase in the nominal wage was due entirely to inflation.) 

Standard measures of economic output are in real terms. The most important of these is gross 
domestic product, or GDP. When growth in GDP is reported in the popular press, it is always real 
GDP growth. In the 2012 ERP, Table B-2, GDP is reported in billions of 2005 dollars. We used a 
similar measure of output, real gross national product, in Example 10.3. 

Interesting things happen when real dollar variables are used in combination with natural loga- 
rithms. Suppose, for example, that average weekly hours worked are related to the real wage as 


log(hours) = By + Blog(w/p) + u. 
Using the fact that log(w/p) = log(w) — log(p), we can write this as 


log(hours) = By + Bilog(w) + Bolog(p) + u, [10.21] 


but with the restriction that B, = —f,. Therefore, the assumption that only the real wage influences 
labor supply imposes a restriction on the parameters of model (10.21). If 6, # —,, then the price 
level has an effect on labor supply, something that can happen if workers do not fully understand the 
distinction between real and nominal wages. 

There are many practical aspects to the actual computation of index numbers, but it would take us 
too far afield to cover those here. Detailed discussions of price indexes can be found in most interme- 
diate macroeconomic texts, such as Mankiw (1994, Chapter 2). For us, it is important to be able to use 
index numbers in regression analysis. As mentioned earlier, because the magnitudes of index numbers 
are not especially informative, they often appear in logarithmic form, so that regression coefficients 
have percentage change interpretations. 

We now give an example of an event study that also uses index numbers. 


Antidumping Filings and Chemical Imports 


Krupp and Pollard (1996) analyzed the effects of antidumping filings by U.S. chemical industries on 
imports of various chemicals. We focus here on one industrial chemical, barium chloride, a cleaning 
agent used in various chemical processes and in gasoline production. The data are contained in the 
file BARIUM. In the early 1980s, U.S. barium chloride producers believed that China was offering 
its U.S. imports an unfairly low price (an action known as dumping), and the barium chloride indus- 
try filed a complaint with the U.S. International Trade Commission (ITC) in October 1983. The ITC 
ruled in favor of the U.S. barium chloride industry in October 1984. There are several questions of 
interest in this case, but we will touch on only a few of them. First, were imports unusually high in 
the period immediately preceding the initial filing? Second, did imports change noticeably after an 
antidumping filing? Finally, what was the reduction in imports after a decision in favor of the U.S. 
industry? 

To answer these questions, we follow Krupp and Pollard by defining three dummy variables: 
befile6 is equal to 1 during the six months before filing, affile6 indicates the six months after fil- 
ing, and afdec6 denotes the six months after the positive decision. The dependent variable is the 
volume of imports of barium chloride from China, chnimp, which we use in logarithmic form. We 
include as explanatory variables, all in logarithmic form, an index of chemical production, chempi 
(to control for overall demand for barium chloride), the volume of gasoline production, gas (another 
demand variable), and an exchange rate index, rtwex, which measures the strength of the dollar 
against several other currencies. The chemical production index was defined to be 100 in June 1977. 
The analysis here differs somewhat from Krupp and Pollard in that we use natural logarithms of all 
variables (except the dummy variables, of course), and we include all three dummy variables in the 
same regression. 
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Using monthly data from February 1978 through December 1988 gives the following: 
A nin 
log(chnimp) = —17.80 + 3.12 log(chempi) + .196 log(gas) 


(21.05) (.48) (.907) 
+ .983 log(rtwex) + .060 befile6 — .032 affile6 — .565 afdec6 [10.22] 
(.400) (.261) (.264) (.286) 


n = 131, R? = .305, R = 271. 


The equation shows that befile6 is statistically insignificant, so there is no evidence that Chinese 
imports were unusually high during the six months before the suit was filed. Further, although the esti- 
mate on affile6 is negative, the coefficient is small (indicating about a 3.2% fall in Chinese imports), and 
it is statistically very insignificant. The coefficient on afdec6 shows a substantial fall in Chinese imports 
of barium chloride after the decision in favor of the U.S. industry, which is not surprising. Because 


the effect is so large, we compute the exact percentage change: 100[exp(—.565) — 1] =~ —43.2%. 
The coefficient is statistically significant at the 5% level against a two-sided alternative. 

The coefficient signs on the control variables are what we expect: an increase in overall chemical 
production increases the demand for the cleaning agent. Gasoline production does not affect Chinese 
imports significantly. The coefficient on log(7twex) shows that an increase in the value of the dollar 
relative to other currencies increases the demand for Chinese imports, as is predicted by economic 
theory. (In fact, the elasticity is not statistically different from 1. Why?) 


Interactions among qualitative and quantitative variables are also used in time series analysis. An 
example with practical importance follows. 


EXAMPLE 10.6 Election Outcomes and Economic Performance 


Fair (1996) summarizes his work on explaining presidential election outcomes in terms of economic 
performance. He explains the proportion of the two-party vote going to the Democratic candidate 
using data for the years 1916 through 1992 (every four years) for a total of 20 observations. We esti- 
mate a simplified version of Fair’s model (using variable names that are more descriptive than his): 


demvote = By + B,partyWH + B,incum + B3partyWH: gnews 
+ BypartyWH- inf + u, 


where demvote is the proportion of the two-party vote going to the Democratic candidate. The explan- 
atory variable partyWH is similar to a dummy variable, but it takes on the value | if a Democrat is in 
the White House and —1 if a Republican is in the White House. Fair uses this variable to impose the 
restriction that the effects of a Republican or a Democrat being in the White House have the same 
magnitude but the opposite sign. This is a natural restriction because the party shares must sum to 
one, by definition. It also saves two degrees of freedom, which is important with so few observa- 
tions. Similarly, the variable incum is defined to be 1 if a Democratic incumbent is running, —1 if a 
Republican incumbent is running, and zero otherwise. The variable gnews is the number of quarters, 
during the administration’s first 15 quarters, when the quarterly growth in real per capita output was 
above 2.9% (at an annual rate), and infis the average annual inflation rate over the first 15 quarters of 
the administration. See Fair (1996) for precise definitions. 

Economists are most interested in the interaction terms partyWH-gnews and partyWH inf. 
Because partyWH equals 1 when a Democrat is in the White House, 8; measures the effect of good 
economic news on the party in power; we expect B; > 0. Similarly, 8, measures the effect that infla- 
tion has on the party in power. Because inflation during an administration is considered to be bad 
news, we expect 6, < 0. 
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The estimated equation using the data in FAIR is 


Oe aie 
demvote = .481 — .0435 partyWH + .0544 incum 


(.012) (.0405) (.0234) 
+ .0108 partyWH ; gnews — .0077 partyWH - inf [10.23] 
(.0041) (.0033) 


n = 20, R = .663, R? = .573. 


All coefficients, except that on partyWH, are statistically significant at the 5% level. Incumbency 
is worth about 5.4 percentage points in the share of the vote. (Remember, demvote is measured as a 
proportion.) Further, the economic news variable has a positive effect: one more quarter of good news 
is worth about 1.1 percentage points. Inflation, as expected, has a negative effect: if average annual 
inflation is, say, two percentage points higher, the party in power loses about 1.5 percentage points of 
the two-party vote. 

We could have used this equation to predict the outcome of the 1996 presidential election between 
Bill Clinton, the Democrat, and Bob Dole, the Republican. (The independent candidate, Ross Perot, 
is excluded because Fair’s equation is for the two-party vote only.) Because Clinton ran as an incum- 
bent, partyWH = 1 and incum = 1. To predict the election outcome, we need the variables gnews 
and inf. During Clinton’s first 15 quarters in office, the annual growth rate of per capita real GDP 
exceeded 2.9% three times, so gnews = 3. Further, using the GDP price deflator reported in Table B-4 
in the 1997 ERP, the average annual inflation rate (computed using Fair’s formula) from the fourth 
quarter in 1991 to the third quarter in 1996 was 3.019. Plugging these into (10.23) gives 


—_—~ 
demvote = 481 — .0435 + .0544 + .0108(3) — .0077(3.019) = .5011. 


Therefore, based on information known before the election in November, Clinton was predicted to 
receive a very slight majority of the two-party vote: about 50.1%. In fact, Clinton won more handily: 
his share of the two-party vote was 54.65%. 


10-5 Trends and Seasonality 


10-5a Characterizing Trending Time Series 


Many economic time series have a common tendency of growing over time. We must recognize that 
some series contain a time trend in order to draw causal inference using time series data. Ignoring the 
fact that two sequences are trending in the same or opposite directions can lead us to falsely conclude 
that changes in one variable are actually caused by changes in another variable. In many cases, two 
time series processes appear to be correlated only because they are both trending over time for rea- 
sons related to other unobserved factors. 

Figure 10.2 contains a plot of labor productivity (output per hour of work) in the United States 
for the years 1947 through 1987. This series displays a clear upward trend, which reflects the fact that 
workers have become more productive over time. 

Other series, at least over certain time periods, have clear downward trends. Because positive 
trends are more common, we will focus on those during our discussion. 

What kind of statistical models adequately capture trending behavior? One popular formulation 
is to write the series {y,} as 


y, = a t+tatt+e,t=1,2,..., [10.24] 


where, in the simplest case, {e,} is an independent, identically distributed (i.i.d.) sequence withE(e,) = 0 
and Var(e,) = a2. Note how the parameter a, multiplies time, t, resulting in a linear time trend. 
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FIGURE 10.2 Output per labor hour in the United States during the years 1947-1987; 1977 = 100. 
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Interpreting a, in (10.24) is simple: holding all other factors (those in e) fixed, a, measures the 
change in y, from one period to the next due to the passage of time. We can write this mathematically 
by defining the change in e, from period t—1 to t as Ae, = e, — e,_,. Equation (10.24) implies that if 
Ae, = 0 then 


Ay, = Yı — Y-1 = Q. 


Another way to think about a sequence that has a linear time trend is that its average value is a 
linear function of time: 


E(y,) = a + at. [10.25] 


If a, > 0, then, on average, y, is growing over time and therefore has an upward trend. If a, < 0, then 
y, has a downward trend. The values of y, do not fall exactly on the line in (10.25) due to randomness, 
but the expected values are on the line. Unlike the mean, the variance of y, is constant across time: 
Var(y,) = Var(e,) = 0%. 

If {e,} is an iid. sequence, then {y,} is an inde- 
pendent, though not identically, distributed sequence. 

- A more realistic characterization of trending time 
In Example 10.4, we used the general fertil- | series allows {e,} to be correlated over time, but this 
ity rate as the dependent variable in an FDL | does not change the flavor of a linear time trend. In 
model. From 1950 through the mid-1980s, fact, what is important for regression analysis under 
ine ies eee eae were) the classical linear model assumptions is that E{y,} 
linear trend with a; < O be realistic for all pane P A 
future time periods? Explain. is linear in t. When we cover large sample properties 

of OLS in Chapter 11, we will have to discuss how 
much temporal correlation in {e,} is allowed. 

Many economic time series are better approximated by an exponential trend, which 
follows when a series has the same average growth rate from period to period. Figure 10.3 plots data 
on annual nominal imports for the United States during the years 1948 through 1995 (ERP 1997, 
Table B-101). 

In the early years, we see that the change in imports over each year is relatively small, whereas 
the change increases as time passes. This is consistent with a constant average growth rate: the 
percentage change is roughly the same in each period. 
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FIGURE 10.3 Nominal U.S. imports during the years 1948-1995 (in billions of U.S. dollars). 
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In practice, an exponential trend in a time series is captured by modeling the natural logarithm of 
the series as a linear trend (assuming that y, > 0): 


log(y,) = Bo + Bit + ept = 1,2,.... [10.26] 


Exponentiating shows that y, itself has an exponential trend: y, = exp(Bo + Bıt + e,). Because we 
will want to use exponentially trending time series in linear regression models, (10.26) turns out to be 
the most convenient way for representing such series. 

How do we interpret B, in (10.26)? Remember that, for small changes, Alog(y,) = log(y,) — 
log(y,,) is approximately the proportionate change in y; 


Alog(y,) = (Y, = YY- [10.27] 


The right-hand side of (10.27) is also called the growth rate in y from period żt—1 to period t. To 
turn the growth rate into a percentage, we simply multiply by 100. If y, follows (10.26), then, taking 
changes and setting Ae, = 0, 


Alog(y,) = Bı, for all t. [10.28] 


In other words, 6, is approximately the average per period growth rate in y,. For example, if t denotes 
year and B, = .027, then y, grows about 2.7% per year on average. 

Although linear and exponential trends are the most common, time trends can be more compli- 
cated. For example, instead of the linear trend model in (10.24), we might have a quadratic time trend: 


Yi = Ay + at + anf + e, [10.29] 
If a, and qa, are positive, then the slope of the trend is increasing, as is easily seen by computing the 


approximate slope (holding e, fixed): 


Ay 
ya + 2at. [10.30] 
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[If you are familiar with calculus, you recognize the right-hand side of (10.30) as the derivative 
of ay + aıt + œf with respect to t.] If a, > 0, but a, < 0, the trend has a hump shape. This 
may not be a very good description of certain trending series because it requires an increasing 
trend to be followed, eventually, by a decreasing trend. Nevertheless, over a given time span, it can 
be a flexible way of modeling time series that have more complicated trends than either (10.24) 
or (10.26). 


10-5b Using Trending Variables in Regression Analysis 


Accounting for explained or explanatory variables that are trending is fairly straightforward in regres- 
sion analysis. First, nothing about trending variables necessarily violates the classical linear model 
Assumptions TS.1 through TS.6. However, we must be careful to allow for the fact that unobserved, 
trending factors that affect y, might also be correlated with the explanatory variables. If we ignore this 
possibility, we may find a spurious relationship between y, and one or more explanatory variables. 
The phenomenon of finding a relationship between two or more trending variables simply because 
each is growing over time is an example of a spurious regression problem. Fortunately, adding a 
time trend eliminates this problem. 

For concreteness, consider a model where two observed factors, x; and xp, affect y, In addition, 
there are unobserved factors that are systematically growing or shrinking over time. A model that 
captures this is 


Yı = Bo + Bixn + Bor + Bot + u, [10.31] 


This fits into the multiple linear regression framework with x, = t. Allowing for the trend in this 
equation explicitly recognizes that y, may be growing (8; > 0) or shrinking (8; < 0) over time for 
reasons essentially unrelated to x,, and xp. If (10.31) satisfies assumptions TS.1, TS.2, and TS.3, then 
omitting ¢ from the regression and regressing y, On X;, Xp Will generally yield biased estimators of B, 
and B,: we have effectively omitted an important variable, t, from the regression. This is especially 
true if x,, and x, are themselves trending, because they can then be highly correlated with t. The next 
example shows how omitting a time trend can result in spurious regression. 


Housing Investment and Prices 


The data in HSEINV are annual observations on housing investment and a housing price index in the 
United States for 1947 through 1988. Let invpc denote real per capita housing investment (in thou- 
sands of dollars) and let price denote a housing price index (equal to 1 in 1982). A simple regression 
in constant elasticity form, which can be thought of as a supply equation for housing stock, gives 


—_—_—_ —_~_. 


log(invpc) = —.550 + 1.241 log(price) 
(.043) (.382) [10.32] 
42, R? = .208, R? = .189. 


n 


The elasticity of per capita investment with respect to price is very large and statistically significant; 
it is not statistically different from one. We must be careful here. Both invpc and price have upward 
trends. In particular, if we regress log(invpc) on t, we obtain a coefficient on the trend equal to 
.0081 (standard error = .0018); the regression of log(price) on t yields a trend coefficient equal to 
.0044 (standard error = .0004). Although the standard errors on the trend coefficients are not neces- 
sarily reliable—these regressions tend to contain substantial serial correlation—the coefficient esti- 
mates do reveal upward trends. 
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To account for the trending behavior of the variables, we add a time trend: 


_ a n aa 


log(invpc) = —.913 — .381 log(price) + .0098 t 
(1.36) (.679) (.0035) [10.33] 
n = 42, R? = 341, R = 307. 


The story is much different now: the estimated price elasticity is negative and not statistically dif- 
ferent from zero. The time trend is statistically significant, and its coefficient implies an approxi- 
mate 1% increase in invpc per year, on average. From this analysis, we cannot conclude that real 
per capita housing investment is influenced at all by price. There are other factors, captured in 
the time trend, that affect invpc, but we have not modeled these. The results in (10.32) show a 
spurious relationship between invpc and price due to the fact that price is also trending upward 
over time. 


In some cases, adding a time trend can make a key explanatory variable more significant. This 
can happen if the dependent and independent variables have different kinds of trends (say, one upward 
and one downward), but movement in the independent variable about its trend line causes movement 
in the dependent variable away from its trend line. 


EXAMPLE 10.8 Fertility Equation 


If we add a linear time trend to the fertility equation (10.18), we obtain 


Z = 111.77 + .279 pe, — 35.59 ww2, + .997 pill, — 1.15 t 
(3.36) (.040) (6.30) (6.626) (.19) [10.34] 
n = 72, R? = 662, R = 642. 


The coefficient on pe is more than triple the estimate from (10.18), and it is much more statistically 
significant. Interestingly, pill is not significant once an allowance is made for a linear trend. As can be 
seen by the estimate, gfr was falling, on average, over this period, other factors being equal. 

Because the general fertility rate exhibited both upward and downward trends during the period from 
1913 through 1984, we can see how robust the estimated effect of pe is when we use a quadratic trend: 


ZF = 124.09 + .348 pe, — 35.88 ww2, — 10.12 pill, — 2.53 t + .0196 2 
(4.36) (.040) (5.71) (6.34) (.39) (.0050) [10.35] 
n = 72, R = .727, = 706. 


The coefficient on pe is even larger and more statistically significant. Now, pill has the expected 
negative effect and is marginally significant, and both trend terms are statistically significant. The 
quadratic trend is a flexible way to account for the unusual trending behavior of gfr. 


You might be wondering in Example 10.8: why stop at a quadratic trend? Nothing prevents us 
from adding, say, f as an independent variable, and, in fact, this might be warranted (see Computer 
Exercise C6). But we have to be careful not to get carried away when including trend terms in a 
model. We want relatively simple trends that capture broad movements in the dependent variable that 
are not explained by the independent variables in the model. If we include enough polynomial terms 
in ft, then we can track any series pretty well. But this offers little help in finding which explanatory 
variables affect y,. 
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10-5c A Detrending Interpretation of Regressions with a Time Trend 


Including a time trend in a regression model creates a nice interpretation in terms of detrending the 
original data series before using them in regression analysis. For concreteness, we focus on model 
(10.31), but our conclusions are much more general. 

When we regress y, On xX, Xp, and t, we obtain the fitted equation 


$, = Bo + Êixa + Boxe + Bot. [10.36] 


We can extend the Frisch-Waugh result on the partialling out interpretation of OLS that we covered in 
Section 3-2 to show that B ı and Bo can be obtained as follows. 

(i) Regress each of y, x,;, and xp on a constant and the time trend ¢ and save the residuals, say, 
¥,%n,%,t = 1,2,...,n. For example, 


Y, = y, — A — Ayt. 
Thus, we can think of y, as being linearly detrended. In detrending y, we have estimated the model 
Y: = A + ait + e, 


by OLS; the residuals from this regression, ê, = Yy, have the time trend removed (at least in the sam- 
ple). A similar interpretation holds for X,; and X,. 
(ii) Run the regression of 


Y, on Ža Xp. [10.37] 


(No intercept is necessary, but including an intercept affects nothing: the intercept will be estimated to 
be zero.) This regression exactly yields By and Bo from (10.36). 

This means that the estimates of primary interest, B ı and Bs; can be interpreted as coming from a 
regression without a time trend, but where we first detrend the dependent variable and all other inde- 
pendent variables. The same conclusion holds with any number of independent variables and if the 
trend is quadratic or of some other polynomial degree. 

If t is omitted from (10.36), then no detrending occurs, and y, might seem to be related to one or 
more of the x, simply because each contains a trend; we saw this in Example 10.7. If the trend term 
is statistically significant, and the results change in important ways when a time trend is added to a 
regression, then the initial results without a trend should be treated with suspicion. 

The interpretation of By and Bo shows that it is a good idea to include a trend in the regression 
if any independent variable is trending, even if y, is not. If y, has no noticeable trend, but, say, x, is 
growing over time, then excluding a trend from the regression may make it look as if x, has no effect 
on y, even though movements of x, about its trend may affect y,. This will be captured if t is included 
in the regression. 


EXAMPLE 10.9 Puerto Rican Employment 


When we add a linear trend to equation (10.17), the estimates are 


log(prepop,) = —8.70 — .169 log(mincov,) + 1.06 log(usgnp,) 

(1.30) (.044) (0.18) 

— 032t [10.38] 
(.005) 

38, R? = .847, R? = .834. 


3 
II 
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The coefficient on log(usgnp) has changed dramatically, from —.012 and insignificant to 1.06 and 
very significant. The coefficient on the minimum wage has changed only slightly, although the stan- 
dard error is notably smaller, making log(mincov) more significant than before. 

The variable prepop, displays no clear upward or downward trend, but log(usgnp) has an upward, 
linear trend. [A regression of log(usgnp) on t gives an estimate of about .03, so that usgnp is grow- 
ing by about 3% per year over the period.] We can think of the estimate 1.06 as follows: when usgnp 
increases by 1% above its long-run trend, prepop increases by about 1.06%. 


10-5d Computing R-Squared When the Dependent 
Variable Is Trending 


R-squareds in time series regressions are often very high, especially compared with typical R-squareds 
for cross-sectional data. Does this mean that we learn more about factors affecting y from time series 
data? Not necessarily. On one hand, time series data often come in aggregate form (such as average 
hourly wages in the U.S. economy), and aggregates are often easier to explain than outcomes on indi- 
viduals, families, or firms, which is often the nature of cross-sectional data. But the usual and adjusted 
R-squareds for time series regressions can be artificially high when the dependent variable is trend- 
ing. Remember that R? is a measure of how large the error variance is relative to the variance of y. 
The formula for the adjusted R-squared shows this directly: 
R = 1 - (62/62), 

where G7 is the unbiased estimator of the error variance, 6? = SST/(n — 1),andSST = X’ ,(y, — y)?. 
Now, estimating the error variance when y, is trending is no problem, provided a time trend is included 
in the regression. However, when E(y,) follows, say, a linear time trend [see (10.24)], SST/(n — 1) 
is no longer an unbiased or consistent estimator of Var(y,). In fact, SST/(n — 1) can substantially 
overestimate the variance in y,, because it does not account for the trend in y,. 

When the dependent variable satisfies linear, quadratic, or any other polynomial trends, it is easy 
to compute a goodness-of-fit measure that first nets out the effect of any time trend on y,. The simplest 
method is to compute the usual R-squared in a regression where the dependent variable has already 
been detrended. For example, if the model is (10.31), then we first regress y, on t and obtain the 
residuals y,. Then, we regress 


Y, ON Xa, Xp, and t. [10.39] 
The R-squared from this regression is 


(<5 [10.40] 


where SSR is identical to the sum of squared residuals from (10.36). Because X ty = }"_,(y, — y)? 
(and usually the inequality is strict), the R-squared from (10.40) is no greater than, and usually less 
than, the R-squared from (10.36). (The sum of squared residuals is identical in both regressions.) 
When y, contains a strong linear time trend, (10.40) can be much less than the usual R-squared. 

The R-squared in (10.40) better reflects how well x, and x, explain y, because it nets out the 
effect of the time trend. After all, we can always explain a trending variable with some sort of trend, 
but this does not mean we have uncovered any factors that cause movements in y,. An adjusted 
R-squared can also be computed based on (10.40): divide SSR by (n — 4) because this is the df in 
(10.36) and divide X 7-17 by (n — 2), as there are two trend parameters estimated in detrending y,. 
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In general, SSR is divided by the df in the usual regression (that includes any time trends), 
and >)"_, ¥? is divided by (n — p), where p is the number of trend parameters estimated in detrending y,. 
Wooldridge (199 1a) provides detailed suggestions for degrees-of-freedom corrections, but a compu- 
tationally simple approach is fine as an approximation: use the adjusted R-squared from the regres- 
sion y, ont, P, ..., P, Xa,- - - , Xy This requires us only to remove the trend from y, to obtain y, and 
then we can use ¥, to compute the usual kinds of goodness-of-fit measures. 


EN eee Housing Investment 


In Example 10.7, we saw that including a linear time trend along with log(price) in the housing 
investment equation had a substantial effect on the price elasticity. But the R-squared from regres- 
sion (10.33), taken literally, says that we are “explaining” 34.1% of the variation in log(invpc). This 
is misleading. If we first detrend log(invpc) and regress the detrended variable on log(price) and t, 
the R-squared becomes .008, and the adjusted R-squared is actually negative. Thus, movements in 
log(price) about its trend have virtually no explanatory power for movements in log(invpc) about its 
trend. This is consistent with the fact that the ż statistic on log(price) in equation (10.33) is very small. 


Before leaving this subsection, we must make a final point. In computing the R-squared form of 
an F statistic for testing multiple hypotheses, we just use the usual R-squareds without any detrend- 
ing. Remember, the R-squared form of the F statistic is just a computational device, and so the usual 
formula is always appropriate. 


10-5e Seasonality 


If a time series is observed at monthly or quarterly intervals (or even weekly or daily), it may exhibit 
seasonality. For example, monthly housing starts in the Midwest are strongly influenced by weather. 
Although weather patterns are somewhat random, we can be sure that the weather during January will 
usually be more inclement than in June, and so housing starts are generally higher in June than in January. 
One way to model this phenomenon is to allow the expected value of the series, y,, to be different in each 
month. As another example, retail sales in the fourth quarter are typically higher than in the previous three 
quarters because of the Christmas holiday. Again, this can be captured by allowing the average retail 
sales to differ over the course of a year. This is in addition to possibly allowing for a trending mean. For 
example, retail sales in the most recent first quarter were higher than retail sales in the fourth quarter from 
30 years ago, because retail sales have been steadily growing. Nevertheless, if we compare average sales 
within a typical year, the seasonal holiday factor tends to make sales larger in the fourth quarter. 

Even though many monthly and quarterly data series display seasonal patterns, not all of them 
do. For example, there is no noticeable seasonal pattern in monthly interest or inflation rates. In addi- 
tion, series that do display seasonal patterns are often seasonally adjusted before they are reported 
for public use. A seasonally adjusted series is one that, in principle, has had the seasonal factors 
removed from it. Seasonal adjustment can be done in a variety of ways, and a careful discussion is 
beyond the scope of this text. [See Harvey (1990) and Hylleberg (1992) for detailed treatments. ] 

Seasonal adjustment has become so common that it is not possible to get seasonally unadjusted 
data in many cases. Quarterly U.S. GDP is a leading example. In the annual Economic Report of 
the President, many macroeconomic data sets reported at monthly frequencies (at least for the most 
recent years) and those that display seasonal patterns are all seasonally adjusted. The major sources 
for macroeconomic time series, including Citibase, also seasonally adjust many of the series. Thus, 
the scope for using our own seasonal adjustment is often limited. 

Sometimes, we do work with seasonally unadjusted data, and it is useful to know that simple 
methods are available for dealing with seasonality in regression models. Generally, we can include a 
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set of seasonal dummy variables to account for seasonality in the dependent variable, the indepen- 
dent variables, or both. 

The approach is simple. Suppose that we have monthly data, and we think that seasonal patterns 
within a year are roughly constant across time. For example, because Christmas always comes at the 
same time of year, we can expect retail sales to be, on average, higher in months late in the year than 
in earlier months. Or, because weather patterns are broadly similar across years, housing starts in 
the Midwest will be higher on average during the summer months than the winter months. A general 
model for monthly data that captures these phenomena is 


y, = Bo + ô feb, + d,mar, + d3apr, +- + 6,,dec, 


[10.41] 
+ Bixa Fo + BX + Up 
where feb, mar,,..., dec, are dummy variables 
GOING FURTHER 10.5 indicating whether time period t corresponds to the 


appropriate month. In this formulation, January is 
the base month, and Bp is the intercept for January. 
If there is no seasonality in y, once the x, have been 


In equation (10.41), what is the intercept 


for March? Explain why seasonal dummy 


variables satisfy the strict exogeneity we 
assumption. controlled for, then 6, through 6,, are all zero. This is 


easily tested via an F test. 


Effects of Antidumping Filings 


In Example 10.5, we used monthly data (in the file BARIUM) that have not been seasonally adjusted. 
Therefore, we should add seasonal dummy variables to make sure none of the important conclusions 
change. It could be that the months just before the suit was filed are months where imports are higher 
or lower, on average, than in other months. When we add the 11 monthly dummy variables as in 
(10.41) and test their joint significance, we obtain p-value = .59, and so the seasonal dummies are 
jointly insignificant. In addition, nothing important changes in the estimates once statistical signifi- 
cance is taken into account. Krupp and Pollard (1996) actually used three dummy variables for the 
seasons (fall, spring, and summer, with winter as the base season), rather than a full set of monthly 
dummies; the outcome is essentially the same. 


If the data are quarterly, then we would include dummy variables for three of the four quarters, 
with the omitted category being the base quarter. Sometimes, it is useful to interact seasonal dummies 
with some of the x, to allow the effect of x, on y, to differ across the year. 

Just as including a time trend in a regression has the interpretation of initially detrending the 
data, including seasonal dummies in a regression can be interpreted as deseasonalizing the data. For 
concreteness, consider equation (10.41) with k = 2. The OLS slope coefficients By and Bo on x, and 
X can be obtained as follows: 


(i) Regress each of y, x,,, and xp on a constant and the monthly dummies, feb, mar, ..., dec, 
and save the residuals, say, Y, X,,, and ¥ņ, for allt = 1, 2,...,7. For example, 
Y, = y, — Ay — A feb, — amar, — «+: — G,,dec,. 


This is one method of deseasonalizing a monthly time series. A similar interpretation holds for X, 
and Xj. 

(ii) Run the regression, without the monthly dummies, of y, on X,; and X,. [just as in (10.37)]. This 
gives B ı and Ê». 

In some cases, if y, has pronounced seasonality, a better goodness-of-fit measure is an R-squared 
based on the deseasonalized y,. This nets out any seasonal effects that are not explained by the x. 
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Wooldridge (1991a) suggests specific degrees-of-freedom adjustments, or one may simply use the 
adjusted R-squared where the dependent variable has been deseasonalized. 

Time series exhibiting seasonal patterns can be trending as well, in which case we should esti- 
mate a regression model with a time trend and seasonal dummy variables. The regressions can then be 
interpreted as regressions using both detrended and deseasonalized series. Goodness-of-fit statistics 
are discussed in Wooldridge (199 1a): essentially, we detrend and deseasonalize y, by regressing on 
both a time trend and seasonal dummies before computing R-squared or adjusted R-squared. 


Summary 


In this chapter, we have covered basic regression analysis with time series data. Under assumptions that 
parallel those for cross-sectional analysis, OLS is unbiased (under TS.1 through TS.3), OLS is BLUE 
(under TS.1 through TS.5), and the usual OLS standard errors, t statistics, and F statistics can be used for 
statistical inference (under TS.1 through TS.6). Because of the temporal correlation in most time series 
data, we must explicitly make assumptions about how the errors are related to the explanatory variables 
in all time periods and about the temporal correlation in the errors themselves. The classical linear model 
assumptions can be pretty restrictive for time series applications, but they are a natural starting point. We 
have applied them to both static regression and finite distributed lag models. 

Logarithms and dummy variables are used regularly in time series applications and in event studies. 
We also discussed index numbers and time series measured in terms of nominal and real dollars. 

Trends and seasonality can be easily handled in a multiple regression framework by including time and 
seasonal dummy variables in our regression equations. We presented problems with the usual R-squared as 
a goodness-of-fit measure and suggested some simple alternatives based on detrending or deseasonalizing. 


CLASSICAL LINEAR MODEL ASSUMPTIONS FOR TIME SERIES REGRESSION 


Following is a summary of the six classical linear model (CLM) assumptions for time series regression 
applications. Assumptions TS.1 through TS.5 are the time series versions of the Gauss-Markov assump- 
tions (which implies that OLS is BLUE and has the usual sampling variances). We only needed TS.1, TS.2, 
and TS.3 to establish unbiasedness of OLS. As in the case of cross-sectional regression, the normality 
assumption, TS.6, was used so that we could perform exact statistical inference for any sample size. 


Assumption TS.1 (Linear in Parameters) 


The stochastic process {(x;, X2 -< , Xæ YA: t = 1,2,...,n} follows the linear model 
Yi = Bo + Bixa + Borg + + BX + Up 
where {u; t= 1,2,...,n} isthe sequence of errors or disturbances. Here, n is the number of observations 


(time periods). 


Assumption TS.2 (No Perfect Collinearity) 
In the sample (and therefore in the underlying time series process), no independent variable is constant nor 
a perfect linear combination of the others. 


Assumption TS.3 (Zero Conditional Mean) 
For each ż, the expected value of the error u, given the explanatory variables for all time periods, is zero. 
Mathematically, E(u |X) = 0, t = 1,2,...,n. 

Assumption TS.3 replaces MLR.4 for cross-sectional regression, and it also means we do not have to 
make the random sampling assumption MLR.2. Remember, Assumption TS.3 implies that the error in each 
time period f is uncorrelated with all explanatory variables in all time periods (including, of course, time 
period f). 
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Assumption TS.4 (Homoskedasticity) 
Conditional on X, the variance of u, is the same for all t: Var(u |X) = Var(u,) = o°, t = 1,2,...,n. 


Assumption TS.5 (No Serial Correlation) 

Conditional on X, the errors in two different time periods are uncorrelated: Corr(u,, u,|X) = 0, for all t # s. 
Recall that we added the no serial correlation assumption, along with the homoskedasticity assump- 

tion, to obtain the same variance formulas that we derived for cross-sectional regression under random 

sampling. As we will see in Chapter 12, Assumption TS.5 is often violated in ways that can make the usual 

statistical inference very unreliable. 


Assumption TS.6 (Normality) 
The errors u, are independent of X and are independently and identically distributed as Normal (0,0°). 


Key Terms 


Autocorrelation Growth Rate Seasonally Adjusted 
Base Period Impact Multiplier Serial Correlation 
Base Value Impact Propensity Short-Run Elasticity 
Contemporaneously Exogenous Index Number Spurious Regression Problem 
Cumulative Effect Lag Distribution Static Model 
Deseasonalizing Linear Time Trend Stochastic Process 
Detrending Long-Run Elasticity Strictly Exogenous 
Event Study Long-Run Multiplier Time Series Process 
Exponential Trend Long-Run Propensity (LRP) Time Trend 
Finite Distributed Lag (FDL) Seasonal Dummy Variables 

Model Seasonality 


Problems 


1 Decide if you agree or disagree with each of the following statements and give a brief explanation of 

your decision: 

(i) Like cross-sectional observations, we can assume that most time series observations are 
independently distributed. 

(ii) The OLS estimator in a time series regression is unbiased under the first three Gauss-Markov 
assumptions. 

(iii) A trending variable cannot be used as the dependent variable in multiple regression analysis. 

(iv) Seasonality is not an issue when using annual time series observations. 


2 Let gGDP, denote the annual percentage change in gross domestic product and let int, denote a short- 
term interest rate. Suppose that gGDP, is related to interest rates by 
gGDP, = ao + Ooint, + d,int,, + up 
where u, is uncorrelated with int, int,_,, and all other past values of interest rates. Suppose that the 
Federal Reserve follows the policy rule: 
int, = Yo + yi\(gGDP,_, — 3) + v, 


where y, > 0. (When last year’s GDP growth is above 3%, the Fed increases interest rates to prevent 
an “overheated” economy.) If v, is uncorrelated with all past values of int, and u,, argue that int, must be 
correlated with u,_,. (Hint: Lag the first equation for one time period and substitute for gGDP,_, in the 
second equation.) Which Gauss-Markov assumption does this violate? 
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3 Suppose y, follows a second order FDL model: 


Vp = Ay + Ôo + Ôi + Ô- + Uy. 


Let z“ denote the equilibrium value of z, and let y* be the equilibrium value of y, such that 
y“ = ay + Sz" + 8z + 62". 
Show that the change in y“, due to a change in z“, equals the long-run propensity times the change in z“: 
Ay* = LRP- Az“. 


This gives an alternative way of interpreting the LRP. 


When the three event indicators befile6, affile6, and afdec6 are dropped from equation (10.22), we 
obtain R? = .281 and R? = .264. Are the event indicators jointly significant at the 10% level? 


Suppose you have quarterly data on new housing starts, interest rates, and real per capita income. 
Specify a model for housing starts that accounts for possible trends and seasonality in the variables. 


In Example 10.4, we saw that our estimates of the individual lag coefficients in a distributed lag model 
were very imprecise. One way to alleviate the multicollinearity problem is to assume that the 6; follow 
a relatively simple pattern. For concreteness, consider a model with four lags: 


Yi = Ay + oz + GiZ—1 + Ôa H 832-3 + O4%—4 + Uy. 


Now, let us assume that the 6; follow a quadratic in the lag, j: 


6; = Y + yıj + VJ, 


for parameters Yo, Yı, and y>. This is an example of a polynomial distributed lag (PDL) model. 

(i) Plug the formula for each 6; into the distributed lag model and write the model in terms of the 
parameters y,, for h = 0, 1, 2. 

(ii) Explain the regression you would run to estimate the y}. 

(iii) The polynomial distributed lag model is a restricted version of the general model. How many 
restrictions are imposed? How would you test these? (Hint: Think F test.) 


In Example 10.4, we wrote the model that explicitly contains the long-run propensity, 09, as 
afr, = ay + Ope, + 6,(pe,_, — pe,) + 6,(pe,-2 — pe,) + u, 


where we omit the other explanatory variables for simplicity. As always with multiple regression anal- 

ysis, 0) should have a ceteris paribus interpretation. Namely, if pe, increases by one (dollar) holding 

(pe,_, — pe,) and (pe,_, — pe,) fixed, gfr, should change by 6. 

(i) If (pe,_, — pe,) and (pe, — pe,) are held fixed but pe, is increasing, what must be true about 
changes in pe,_, and pe,_»? 

(ii) How does your answer in part (i) help you to interpret 6 in the above equation as the LRP? 


In the linear model given in equation (10.8), the explanatory variables x, = (x,,..., X) are said to be 
sequentially exogenous (sometimes called weakly exogenous) if 


E(uJx, X;-15.--5X;) =0,t = 1,2,..., 


so that the errors are unpredictable given current and all past values of the explanatory variables. 
(i) Explain why sequential exogeneity is implied by strict exogeneity. 
(ii) Explain why contemporaneous exogeneity is implied by sequential exogeneity. 
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(iii) Are the OLS estimators generally unbiased under the sequential exogeneity assumption? Explain. 
(iv) Consider a model to explain the annual rate of HIV infections (H/Vrate) as a distributed lag of 
per capita condom usage (pccon) for a state, region, or province: 


E(HIVrate,|pccon,, pccont,_\, ...,) = A + dpecon, + 5,pccon,_, 
+ ô peccon,- + 63pccon,_3- 


Explain why this model satisfies the sequential exogeneity assumption. Does it seem likely that 
strict exogeneity holds too? 


Computer Exercises 


C1 


C2 


C3 


C4 
C5 


C6 


C7 


In October 1979, the Federal Reserve changed its policy of using finely tuned interest rate adjustments 
and instead began targeting the money supply. Using the data in INTDEF, define a dummy variable 
equal to 1 for years after 1979. Include this dummy in equation (10.15) to see if there is a shift in the 
interest rate equation after 1979. What do you conclude? 


Use the data in BARIUM for this exercise. 

(i) Add a linear time trend to equation (10.22). Are any variables, other than the trend, statistically 
significant? 

(ii) In the equation estimated in part (i), test for joint significance of all variables except the time 
trend. What do you conclude? 

(iii) Add monthly dummy variables to this equation and test for seasonality. Does including the 
monthly dummies change any other estimates or their standard errors in important ways? 


Add the variable log(prgnp) to the minimum wage equation in (10.38). Is this variable significant? 
Interpret the coefficient. How does adding log(prgnp) affect the estimated minimum wage effect? 


Use the data in FERTIL3 to verify that the standard error for the LRP in equation (10.19) is about .030. 


Use the data in EZANDERS for this exercise. The data are on monthly unemployment claims in 
Anderson Township in Indiana, from January 1980 through November 1988. In 1984, an enterprise 
zone (EZ) was located in Anderson (as well as other cities in Indiana). [See Papke (1994) for details. ] 
Gi) Regress log(uclms) on a linear time trend and 11 monthly dummy variables. What was the 
overall trend in unemployment claims over this period? (Interpret the coefficient on the time 
trend.) Is there evidence of seasonality in unemployment claims? 
Gi) Add ez, a dummy variable equal to one in the months Anderson had an EZ, to the regression 
in part (i). Does having the enterprise zone seem to decrease unemployment claims? By how 
much? [You should use formula (7.10) from Chapter 7.] 
(iii) What assumptions do you need to make to attribute the effect in part (ii) to the creation of an EZ? 


Use the data in FERTIL3 for this exercise. 

(i) Regress gfr, on t and f and save the residuals. This gives a detrended gfr,, say, gf,. 

(ii) Regress gf, on all of the variables in equation (10.35), including f and £. Compare the 
R-squared with that from (10.35). What do you conclude? 

(iii) Reestimate equation (10.35) but add f to the equation. Is this additional term statistically 
significant? 


Use the data set CONSUMP for this exercise. 

(i) Estimate a simple regression model relating the growth in real per capita consumption (of 
nondurables and services) to the growth in real per capita disposable income. Use the change 
in the logarithms in both cases. Report the results in the usual form. Interpret the equation and 
discuss statistical significance. 
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C10 
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(ii) Add a lag of the growth in real per capita disposable income to the equation from part (i). What 
do you conclude about adjustment lags in consumption growth? 
(iii) Add the real interest rate to the equation in part (i). Does it affect consumption growth? 


Use the data in FERTIL3 for this exercise. 

(i) Add pe,_3 and pe,_4 to equation (10.19). Test for joint significance of these lags. 

(ii) Find the estimated long-run propensity and its standard error in the model from part (i). Compare 
these with those obtained from equation (10.19). 

(iii) Estimate the polynomial distributed lag model from Problem 6. Find the estimated LRP and 
compare this with what is obtained from the unrestricted model. 


Use the data in VOLAT for this exercise. The variable rsp500 is the monthly return on the Standard & 
Poor’s 500 stock market index, at an annual rate. (This includes price changes as well as dividends.) 
The variable i3 is the return on three-month T-bills, and pcip is the percentage change in industrial 
production; these are also at an annual rate. 

(i) Consider the equation 


rsp500, = Bo + Bypcip, + Boi3, + u, 


What signs do you think 6, and £, should have? 

(ii) | Estimate the previous equation by OLS, reporting the results in standard form. Interpret the 
signs and magnitudes of the coefficients. 

(iii) Which of the variables is statistically significant? 

(iv) Does your finding from part (iii) imply that the return on the S&P 500 is predictable? Explain. 


Consider the model estimated in (10.15); use the data in INTDEF. 

(i) Find the correlation between inf and def over this sample period and comment. 

(ii) Add a single lag of inf and def to the equation and report the results in the usual form. 

(iii) Compare the estimated LRP for the effect of inflation with that in equation (10.15). Are they 
vastly different? 

(iv) Are the two lags in the model jointly significant at the 5% level? 


The file TRAFFIC2 contains 108 monthly observations on automobile accidents, traffic laws, and 
some other variables for California from January 1981 through December 1989. Use this data set to 
answer the following questions. 

(i) During what month and year did California’s seat belt law take effect? When did the highway 
speed limit increase to 65 miles per hour? 

(ii) Regress the variable log(totacc) on a linear time trend and 11 monthly dummy variables, using 
January as the base month. Interpret the coefficient estimate on the time trend. Would you say 
there is seasonality in total accidents? 

(iii) Add to the regression from part (ii) the variables wkends, unem, spdlaw, and beltlaw. Discuss 
the coefficient on the unemployment variable. Does its sign and magnitude make sense to you? 

(iv) In the regression from part (iii), interpret the coefficients on spdlaw and beltlaw. Are the 
estimated effects what you expected? Explain. 

(v) The variable prcfat is the percentage of accidents resulting in at least one fatality. Note that this 
variable is a percentage, not a proportion. What is the average of prcfat over this period? Does 
the magnitude seem about right? 

(vi) Run the regression in part (iii) but use prcfat as the dependent variable in place of log(totacc). 
Discuss the estimated effects and significance of the speed and seat belt law variables. 


(i) Estimate equation (10.2) using all observations in PHILLIPS and report the results in the usual 
form. How many observations do you have now? 

(ii) Compare the estimates from part (i) with those in equation (10.14). In particular, does adding the 
extra years help in obtaining an estimated tradeoff between inflation and unemployment? Explain. 
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(iii) Now run the regression using only the years 2007 through 2017. How do these estimates differ 
from those in equation (10.14)? Are the estimates using the most recent seven years precise 
enough to draw any firm conclusions? Explain. 

(iv) Consider a simple regression setup in which we start with n time series observations and then 
split them into an early time period and a later time period. In the first time period we have 
n, observations and in the second period n, observations. Draw on the previous parts of this 
exercise to evaluate the following statement: “Generally, we can expect the slope estimate using 
all n observations to be roughly equal to a weighted average of the slope estimates on the early 
and later subsamples, where the weights are n,/n and n,/n, respectively.” 


Use the data in MINWAGE for this exercise. In particular, use the employment and wage series for sec- 

tor 232 (Men’s and Boys’ Furnishings). The variable gwage232 is the monthly growth (change in logs) 

in the average wage in sector 232, gemp232 is the growth in employment in sector 232, gmwage is the 
growth in the federal minimum wage, and gcpi is the growth in the (urban) Consumer Price Index. 

(G) Run the regression gwage232 on gmwage, gcpi. Do the sign and magnitude of —— make 
sense to you? Explain. Is gmwage statistically significant? 

Gi) Add lags 1 through 12 of gmwage to the equation in part (i). Do you think it is necessary to 
include these lags to estimate the long-run effect of minimum wage growth on wage growth in 
sector 232? Explain. 

(iii) Run the regression gemp232 on gmwage, gcpi. Does minimum wage growth appear to have a 
contemporaneous effect on gemp232? 

(iv) Add lags 1 through 12 to the employment growth equation. Does growth in the minimum wage 
have a statistically significant effect on employment growth, either in the short run or long run? 
Explain. 


Use the data in APPROVAL to answer the following questions. The data set consists of 78 months of 
data during the presidency of George W. Bush. (The data end in July 2007, before Bush left office.) 
In addition to economic variables and binary indicators of various events, it includes an approval 
rate, approve, collected by Gallup. (Caution: One should also attempt Computer Exercise C14 in 
Chapter 11 to gain a more complete understanding of the econometric issues involved in analyzing 
these data.) 

(i) | What is the range of the variable approve? What is its average value? 

(ii) Estimate the model 


approve, = By + B,lcpifood, + B,lrgasprice, + B,unemploy, + u, 


where the first two variables are in logarithmic form, and report the estimates in the usual way. 

(iii) Interpret the coefficients in the estimates from part (ii). Comment on the signs and sizes of the 
effects, as well as statistical significance. 

(iv) Add the binary variables sep11 and iraginvade to the equation from part (ii). Interpret the 
coefficients on the dummy variables. Are they statistically significant? 

(v) Does adding the dummy variables in part (iv) change the other estimates much? Are any of the 
coefficients in part (iv) hard to rationalize? 

(vi) Add /sp500 to the regression in part (iv). Controlling for other factors, does the stock market 
have an important effect on the presidential approval rating? 
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n Chapter 10, we discussed the finite sample properties of OLS for time series data under increasingly 

stronger sets of assumptions. Under the full set of classical linear model assumptions for time series, 

TS.1 through TS.6, OLS has exactly the same desirable properties that we derived for cross-sectional 
data. Likewise, statistical inference is carried out in the same way as it was for cross-sectional analysis. 

From our cross-sectional analysis in Chapter 5, we know that there are good reasons for studying 
the large sample properties of OLS. For example, if the error terms are not drawn from a normal dis- 
tribution, then we must rely on the central limit theorem (CLT) to justify the usual OLS test statistics 
and confidence intervals. 

Large sample analysis is even more important in time series contexts. (This is somewhat ironic 
given that large time series samples can be difficult to come by; but we often have no choice other 
than to rely on large sample approximations.) In Section 10-3, we explained how the strict exogene- 
ity assumption (TS.3) might be violated in static and distributed lag models. As we will show in 
Section 11-2, models with lagged dependent variables must violate Assumption TS.3. 

Unfortunately, large sample analysis for time series problems is fraught with many more difficul- 
ties than it was for cross-sectional analysis. In Chapter 5, we obtained the large sample properties of 
OLS in the context of random sampling. Things are more complicated when we allow the observa- 
tions to be correlated across time. Nevertheless, the major limit theorems hold for certain, although 
not all, time series processes. The key is whether the correlation between the variables at different 
time periods tends to zero quickly enough. Time series that have substantial temporal correlation 
require special attention in regression analysis. This chapter will alert you to certain issues pertaining 


to such series in regression analysis. 
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11-1 Stationary and Weakly Dependent Time Series 


In this section, we present the key concepts that are needed to apply the usual large sample approxi- 
mations in regression analysis with time series data. The details are not as important as a general 
understanding of the issues. 


11-1a Stationary and Nonstationary Time Series 


Historically, the notion of a stationary process has played an important role in the analysis of time 
series. A stationary time series process is one in which the probability distributions are stable over 
time in the following sense: If we take any collection of random variables in the sequence and then 
shift that sequence ahead h time periods, the joint probability distribution must remain unchanged. 
A formal definition of stationarity follows. 


Stationary Stochastic Process. The stochastic process {x; t = 1, 2, ...} is stationary if for every 


collection of time indices 1 = t < h <+: < tm the joint distribution of (Mis Ypres a) is the same 
as the joint distribution of (x, +r X,+n +--+ X, +n) for all integers h = 1. 

This definition is a little abstract, but its meaning is pretty straightforward. One implication (by 
choosing m = | and ft, = 1) is that x, has the same distribution as x, for all t = 2,3,.... In other 


words, the sequence {x,: t = 1, 2, . . .} is identically distributed. Stationarity requires even more. For 
example, the joint distribution of (x,, x2) (the first two terms in the sequence) must be the same as the 
joint distribution of (x, x,,,) for any tf = 1. Again, this places no restrictions on how x, and x, are 
related to one another; indeed, they may be highly correlated. Stationarity does require that the nature 
of any correlation between adjacent terms is the same across all time periods. 

A stochastic process that is not stationary is said to be a nonstationary process. Because sta- 
tionarity is an aspect of the underlying stochastic process and not of the available single realization, it 
can be very difficult to determine whether the data we have collected were generated by a stationary 
process. However, it is easy to spot certain sequences that are not stationary. A process with a time 
trend of the type covered in Section 10-5 is clearly nonstationary: at a minimum, its mean changes 
over time. 

Sometimes, a weaker form of stationarity suffices. If {x,t = 1, 2,...} has a finite second 
moment, that is, E(x?) < © for all t, then the following definition applies. 


Covariance Stationary Process. A stochastic process {x; t = 1, 2, . . .} with a finite second moment 
[E(x7) < ©] is covariance stationary if (i) E(x,) is constant; (ii) Var(x,) is constant; and (iii) for any 
t, h = 1, Cov(x,, X,+) depends only on h and not on t. 
i Covariance stationarity focuses only on the first 
two moments of a stochastic process: the mean and 
j variance of the process are constant across time, 
Suppose that {y: t = 1, 2, ...} is generated and the covariance between x, and x,,, depends only 
by y; = 6) + ôt + e,, where ô #0, and : 
(ec? = 1,2...) is an iic sequence with on the distance between the two terms, h, and not on 
mean zero and variance 02. (i) Is {y} covari- the location of the initial time period, ¢. It follows 
ance stationary? (ii) Is y, — E(y;) covariance immediately that the correlation between x, and x,,; 
stationary? also depends only on A. 

If a stationary process has a finite second 
moment, then it must be covariance stationary, but 
the converse is certainly not true. Sometimes, to emphasize that stationarity is a stronger requirement 
than covariance stationarity, the former is referred to as strict stationarity. Because strict stationarity 
simplifies the statements of some of our subsequent assumptions, “stationarity” for us will always 
mean the strict form. 
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How is stationarity used in time series econometrics? On a technical level, stationarity simplifies 
statements of the law of large numbers (LLN) and the CLT, although we will not worry about formal 
statements in this chapter. On a practical level, if we want to understand the relationship between two 
or more variables using regression analysis, we need to assume some sort of stability over time. If we 
allow the relationship between two variables (say, y, and x,) to change arbitrarily in each time period, 
then we cannot hope to learn much about how a change in one variable affects the other variable if we 
only have access to a single time series realization. 

In stating a multiple regression model for time series data, we are assuming a certain form of 
stationarity in that the £, do not change over time. Further, Assumptions TS.4 and TS.5 imply that 
the variance of the error process is constant over time and that the correlation between errors in two 
adjacent periods is equal to zero, which is clearly constant over time. 


11-1b Weakly Dependent Time Series 


Stationarity has to do with the joint distributions of a process as it moves through time. A very differ- 
ent concept is that of weak dependence, which places restrictions on how strongly related the random 
variables x, and x,,;, can be as the time distance between them, h, increases. The notion of weak 
dependence is most easily discussed for a stationary time series: loosely speaking, a stationary time 
series process {x,:t = 1, 2,...} is said to be weakly dependent if x, and x,,, are “almost independ- 
ent” as h increases without bound. A similar statement holds true if the sequence is nonstationary, but 
then we must assume that the concept of being almost independent does not depend on the starting 
point, t. 

The description of weak dependence given in the previous paragraph is necessarily vague. 
We cannot formally define weak dependence because there is no definition that covers all cases of 
interest. There are many specific forms of weak dependence that are formally defined, but these are 
well beyond the scope of this text. [See White (1984), Hamilton (1994), and Wooldridge (1994b) for 
advanced treatments of these concepts. ] 

For our purposes, an intuitive notion of the meaning of weak dependence is sufficient. Covariance 
stationary sequences can be characterized in terms of correlations: a covariance stationary time series 
is weakly dependent if the correlation between x, and x,,, goes to zero “sufficiently quickly” as 
h — œ. (Because of covariance stationarity, the correlation does not depend on the starting point, t.) 
In other words, as the variables get farther apart in time, the correlation between them becomes 
smaller and smaller. Covariance stationary sequences where Corr(x,, x,4,) > 0 as h > © are said to 
be asymptotically uncorrelated. Intuitively, this is how we will usually characterize weak depend- 
ence. Technically, we need to assume that the correlation converges to zero fast enough, but we will 
gloss over this. 

Why is weak dependence important for regression analysis? Essentially, it replaces the assump- 
tion of random sampling in implying that the LLN and the CLT hold. The most well-known CLT for 
time series data requires stationarity and some form of weak dependence: thus, stationary, weakly 
dependent time series are ideal for use in multiple regression analysis. In Section 11-2, we will argue 
that OLS can be justified quite generally by appealing to the LLN and the CLT. Time series that are 
not weakly dependent—examples of which we will see in Section 11-3—do not generally satisfy the 
CLT, which is why their use in multiple regression analysis can be tricky. 

The simplest example of a weakly dependent time series is an independent, identically distrib- 
uted sequence: a sequence that is independent is trivially weakly dependent. A more interesting exam- 
ple of a weakly dependent sequence is 


X, = e, + aye,_),t=1,2,..., [11.1] 


where {e; t = 0, 1,...} is an iid. sequence with zero mean and variance g2. The process {x,} 
is called a moving average process of order one [MA(1)]: x, is a weighted average of e, and e,_|; 
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in the next period, we drop e,_,, and then x,,, depends on e,,, and e,. Setting the coefficient of 
e, to 1 in (11.1) is done without loss of generality. [In equation (11.1), we use x, and e, as generic labels 
for time series processes. They need have nothing to do with the explanatory variables or errors in a time 
series regression model, although both the explanatory variables and errors could be MA(1) processes. ] 

Why is an MA(1) process weakly dependent? Adjacent terms in the sequence are correlated: 
because X41 = e41 + ae, Cov(x, X1) = a, Var(e,) = ajo}. Because Var(x,) = (1 + aj)o%, 
Corr(x,, X41) = @/(1 + aî). For example, if a, = .5, then Corr(x,, x,,;) = .4. [The maximum posi- 
tive correlation occurs when a, = 1, in which case, Corr(x,, x,,;) = .5.] However, once we look at 
variables in the sequence that are two or more time periods apart, these variables are uncorrelated 
because they are independent. For example, x,.. = €,4. + a é,,, is independent of x, because {e,} is 
independent across t. Due to the identical distribution assumption on the e, {x,} in (11.1) is actually 
stationary. Thus, an MA(1) is a stationary, weakly dependent sequence, and the LLN and the CLT can 
be applied to {x,}. 

A more popular example is the process 


Y = pi + ep t= 1,2,.... [11.2] 


The starting point in the sequence is yọ(at t = 0), and {e,: t = 1, 2,.. .} is an i.i.d. sequence with zero 
mean and variance g. We also assume that the e, are independent of yọ and that E(yọ) = 0. This is 
called an autoregressive process of order one [AR(1)]. 

The crucial assumption for weak dependence of an AR(1) process is the stability condition 
|p,| < 1. Then, we say that {y,} is a stable AR(1) process. 

To see that a stable AR(1) process is asymptotically uncorrelated, it is useful to assume that the 
process is covariance stationary. (In fact, it can generally be shown that {y,} is strictly stationary, but 
the proof is somewhat technical.) Then, we know that E(y,) = E(y,_;), and from (11.2) with p; # 1, 
this can happen only if E(y,) = 0. Taking the variance of (11.2) and using the fact that e, and y,_, are 
independent (and therefore uncorrelated), Var(y,) = p;{Var(y,_,) + Var(e,), and so, under covari- 
ance stationarity, we must have o? = pja; + o2. Because pî < 1 by the stability condition, we can 
easily solve for o?: l l 


o? = o?/(1 — pi). [11.3] 


~ 


Now, we can find the covariance between y, and y,+, for h = 1. Using repeated substitution, 


Yin = PYit+h-1 + Ern = Pil PiYin-2 + €rth—1) + en 
ae) = 
= PiYih-2 T Pilen- + Crh = 
=r) n-1 
= py, + Pi ery FoF Pilni T Erw 


Because E(y,) = 0 for all t, we can multiply this last equation by y, and take expectations to obtain 
Cov(y,, ¥;+,)- Using the fact that e,,; is uncorrelated with y, for all j = 1 gives 


Cov(y, Vern) = EO Yrar) = PEY) + pt Ee) + + Elverta) 
= piE(y;) = pias. 
Because g, is the standard deviation of both y, and y,,,, we can easily find the correlation between 
y, and y,,, for any h = 1: 


Corr(y,, Yin) = Cov(y, Yien)/(oyoy) = pt. [11.4] 


In particular, Corr(y,, y,.1) = pı, SO p; is the correlation coefficient between any two adjacent terms 
in the sequence. 

Equation (11.4) is important because it shows that, although y, and y,+, are correlated for any 
h = 1, this correlation gets very small for large h: because |p,| < 1, p? > 0 as h > ~. Even when 
pı is large—say, .9, which implies a very high, positive correlation between adjacent terms—the 
correlation between y, and y,,,, tends to zero fairly rapidly. For example, Corr(y,, y,,5) = .591, 
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Corr(y,, Y+10) = 349, and Corr(y,, ¥,429) = .122. If t indexes year, this means that the correlation 
between the outcome of two y that are 20 years apart is about .122. When p, is smaller, the correlation 
dies out much more quickly. (You might try p, = .5 to verify this.) 

This analysis heuristically demonstrates that a stable AR(1) process is weakly dependent. The 
AR(1) model is especially important in multiple regression analysis with time series data. We will 
cover additional applications in Chapter 12 and the use of it for forecasting in Chapter 18. 

There are many other types of weakly dependent time series, including hybrids of autoregressive 
and moving average processes. But the previous examples work well for our purposes. 

Before ending this section, we must emphasize one point that often causes confusion in time 
series econometrics. A trending series, though certainly nonstationary, can be weakly dependent. In 
fact, in the simple linear time trend model in Chapter 10 [see equation (10.24)], the series {y,} was 
actually independent. A series that is stationary about its time trend, as well as weakly dependent, is 
often called a trend-stationary process. (Notice that the name is not completely descriptive because 
we assume weak dependence along with stationarity.) Such processes can be used in regression analy- 
sis just as in Chapter 10, provided appropriate time trends are included in the model. 


11-2 Asymptotic Properties of OLS 


In Chapter 10, we saw some cases in which the classical linear model assumptions are not satisfied 
for certain time series problems. In such cases, we must appeal to large sample properties of OLS, just 
as with cross-sectional analysis. In this section, we state the assumptions and main results that justify 
OLS more generally. The proofs of the theorems in this chapter are somewhat difficult and therefore 
omitted. See Wooldridge (1994b). 


Assumption TS.1’ Linearity and Weak Dependence 


We assume the model is exactly as in Assumption TS.1, but now we add the assumption that 


{(%, y): t = 1, 2,...} is stationary and weakly dependent. In particular, the LLN and the CLT can be 
applied to sample averages. 


The linear in parameters requirement again means that we can write the model as 


Ye = Bo + Bixa Fo + BkXi + Uy, [11.5] 


where the £; are the parameters to be estimated. Unlike in Chapter 10, the x, can include lags of the 
dependent variable. As usual, lags of explanatory variables are also allowed. 

We have included stationarity in Assumption TS.1’ for convenience in stating and interpreting 
assumptions. If we were carefully working through the asymptotic properties of OLS, as we do in 
Advanced Treatment E, stationarity would also simplify those derivations. But stationarity is not at all 
critical for OLS to have its standard asymptotic properties. (As mentioned in Section 1 1-1, by assuming 
the 6; are constant across time, we are already assuming some form of stability in the distributions over 
time.) The important extra restriction in Assumption TS.1’ as compared with Assumption TS.1 is the 
weak dependence assumption. In Section 11-1, we spent some effort discussing weak dependence for a 
time series process because it is by no means an innocuous assumption. Technically, Assumption TS. 1’ 
requires weak dependence on multiple time series (y, and elements of x,), and this entails putting 
restrictions on the joint distribution across time. The details are not particularly important and are, 
anyway, beyond the scope of this text; see Wooldridge (1994). It is more important to understand 
the kinds of persistent time series processes that violate the weak dependence requirement, and we 
will turn to that in the next section. There, we also discuss the use of persistent processes in multiple 
regression models. 
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Naturally, we still rule out perfect collinearity. 


Assumption TS.2’ No Perfect Collinearity 


Same as Assumption TS.2. 


Assumption TS.3’ Zero Conditional Mean 


The explanatory variables x, = (X4, Xj, ...,X) are contemporaneously exogenous as in equation 
(10.10): E(u,x,) = 0. 


This is the most natural assumption concerning the relationship between u, and the explanatory vari- 
ables. It is much weaker than Assumption TS.3 because it puts no restrictions on how u, is related 
to the explanatory variables in other time periods. We will see examples that satisfy TS.3’ shortly. 
By stationarity, if contemporaneous exogeneity holds for one time period, it holds for them all. 
Relaxing stationarity would simply require us to assume the condition holds for allt = 1,2,.... 

For certain purposes, it is useful to know that the following consistency result only requires u, to 
have zero unconditional mean and to be uncorrelated with each x,: 


E(u,) = 0, Cov(x,;, u) =0,j=1,...,k. [11.6] 


We will work mostly with the zero conditional mean assumption because it leads to the most straight- 
forward asymptotic analysis. 


11141111311 CONSISTENCY OF OLS 


11.1 Under TS.1’, TS.2', and TS.3’, the OLS estimators are consistent: plim Ê; = 6, j =0,1,..., 


There are some key practical differences between Theorems 10.1 and 11.1. First, in Theorem 
11.1, we conclude that the OLS estimators are consistent, but not necessarily unbiased. Second, in 
Theorem 11.1, we have weakened the sense in which the explanatory variables must be exogenous, 
but weak dependence is required in the underlying time series. Weak dependence is also crucial in 
obtaining approximate distributional results, which we cover later. 


Static Model 


Consider a static model with two explanatory variables: 


Yı = Bo + Bin + Boz + u, [11.7] 
Under weak dependence, the condition sufficient for consistency of OLS is 
E(ulzy, Zn) = 0. [11.8] 


This rules out omitted variables that are in u, and are correlated with either z, or zp. Also, no function 
of z; or zp can be correlated with u, and so Assumption TS.3’ rules out misspecified functional form, 
just as in the cross-sectional case. Other problems, such as measurement error in the variables z, or 
Z, can cause (11.8) to fail. 
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Importantly, Assumption TS.3’ does not rule out correlation between, say, u,—; and z,,. This type 
of correlation could arise if z,, is related to past y,_,, such as 


Za = Ôo + O1y-1 + v, [11.9] 


For example, z,, might be a policy variable, such as monthly percentage change in the money supply, 
and this change might depend on last month’s rate of inflation (y,_,). Such a mechanism generally 
causes z4 and u,_, to be correlated (as can be seen by plugging in for y,_,). This kind of feedback is 
allowed under Assumption TS.3’. 


Finite Distributed Lag Model 


In the finite distributed lag model, 


VY, = Ay + oz, + 64%) + ÊZ- + Uy; [11.10] 

a very natural assumption is that the expected value of u,, given current and all past values of z, is 
zero: 

E(u, Zo li-l» Zi- Sp—39 + + ) =0. [11.11] 


This means that, once z,, z,_, and z,_ are included, no further lags of z affect E(y,\z,, Z1, Z2 Z-3 «J 
if this were not true, we would put further lags into the equation. For example, y, could be the annual 
percentage change in investment and z, a measure of interest rates during year t. When we set 
X, = (Zn Z- Z-2), Assumption TS.3’ is then satisfied: OLS will be consistent. As in the previous 
example, TS.3’ does not rule out feedback from y to future values of z. 


The previous two examples do not necessarily require asymptotic theory because the explanatory 
variables could be strictly exogenous. The next example clearly violates the strict exogeneity assump- 
tion; therefore, we can only appeal to large sample properties of OLS. 


AR(1) Model 
Consider the AR(1) model, 


Yi = Bo + Biv + Uy, [11.12] 
where the error u, has a zero expected value, given all past values of y: 
E(uly,-15 Y-2 +.) = 0. [11.13] 
Combined, these two equations imply that 
E(yly-1, Yz» .) = E(y,ly,-1) = Bo + Biyi- [11.14] 


This result is very important. First, it means that, once y lagged one period has been controlled for, 
no further lags of y affect the expected value of y,. (This is where the name “first order” originates.) 
Second, the relationship is assumed to be linear. 

Because x, contains only y,_;, equation (11.13) implies that Assumption TS.3’ holds. By contrast, 
the strict exogeneity assumption needed for unbiasedness, Assumption TS.3, does not hold. Because 
the set of explanatory variables for all time periods includes all of the values on y except the last, 
(Yo. Yo «+ +s Yn-1), Assumption TS.3 requires that, for all t, u,is uncorrelated with each of yp, Yis <- <, Yn—1- 
This cannot be true. In fact, because u, is uncorrelated with y,_, under (11.13), u, and y, must be cor- 
related. In fact, it is easily seen that Cov(y,, u,) = Var(u,) > 0. Therefore, a model with a lagged 
dependent variable cannot satisfy the strict exogeneity Assumption TS.3. 
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For the weak dependence condition to hold, we must assume that |6,| < 1, as we discussed in 
Section 11-1. If this condition holds, then Theorem 11.1 implies that the OLS estimator from the 
regression of y, on y,_, produces consistent estimators of 6, and §,. Unfortunately, Êi is biased, and 
this bias can be large if the sample size is small or if 6; is near |. (For £; near 1, Ê: can have a severe 
downward bias.) In moderate to large samples, B, should be a good estimator of 64. 


When using the standard inference procedures, we need to impose versions of the homoskedas- 
ticity and no serial correlation assumptions. These are less restrictive than their classical linear model 
counterparts from Chapter 10. 


Assumption TS.4’ Homoskedasticity 


The errors are contemporaneously homoskedastic, that is, Var(u;|x,) = o°. 


Assumption TS.5’ No Serial Correlation 


Foralk AS ETUI 7G) = ©, 


In TS.4’, note how we condition only on the explanatory variables at time t (compare to TS.4). 
In TS.5’, we condition only on the explanatory variables in the time periods coinciding with u, and u,. 
As stated, this assumption is a little difficult to interpret, but it is the right condition for studying the 
large sample properties of OLS in a variety of time series regressions. When considering TS.5’, we 
often ignore the conditioning on x, and x, and we think about whether u, and u, are uncorrelated, for 
allt # s. 

Serial correlation is often a problem in static and finite distributed lag regression models: nothing 
guarantees that the unobservables u, are uncorrelated over time. Importantly, Assumption TS.5’ does 
holdin the AR(1) model stated in equations (11.12) and (11.13). Because the explanatory variable at time 
tis y,-;, we must show that E(u,u,ly,-1, Y;-,) = Oforallt # s. To see this, suppose thats < t. (The other 
case follows by symmetry.) Then, because u, = y, — Bo — B1y,—1, Us is a function of y dated before 
time t. But by (11.13), E(ulus, y,-1, Ys-1) = 0, and so E(u,us|tts, Y1, Ys—1) = UsE(uly,-1, Ys-1) = 0. 
By the law of iterated expectations (see Math Refresher B), E(u,u,|y,-), Ys-1) = 0. This is very 
important: as long as only one lag belongs in (11.12), the errors must be serially uncorrelated. We will 
discuss this feature of dynamic models more generally in Section 11-4. 

We now obtain an asymptotic result that is practically identical to the cross-sectional case. 


TLIE ASYMPTOTIC NORMALITY OF OLS 


11.2 Under TS.1’ through TS.5’, the OLS estimators are asymptotically normally distributed. Further, the 
usual OLS standard errors, t statistics, F statistics, and LM statistics are asymptotically valid. 


This theorem provides additional justification for at least some of the examples estimated in Chapter 10: 
even if the classical linear model assumptions do not hold, OLS is still consistent, and the usual 
inference procedures are valid. Of course, this hinges on TS.1’ through TS.5’ being true. In the next 
section, we discuss ways in which the weak dependence assumption can fail. The problems of serial 
correlation and heteroskedasticity are treated in Chapter 12. 
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Efficient Markets Hypothesis 


We can use asymptotic analysis to test a version of the efficient markets hypothesis (EMH). Let y, be 
the weekly percentage return (from Wednesday close to Wednesday close) on the New York Stock 
Exchange composite index. A strict form of the EMH states that information observable to the market 
prior to week ¢ should not help to predict the return during week t. If we use only past information on 
y, the EMH is stated as 


E(yly,-15 Yi- ++ .) = E(y,). [11.15] 


If (11.15) is false, then we could use information on past weekly returns to predict the current return. 
The EMH presumes that such investment opportunities will be noticed and will disappear almost 
instantaneously. 

One simple way to test (11.15) is to specify the AR(1) model in (11.12) as the alternative model. 
Then, the null hypothesis is easily stated as Hy: 6, = 0. Under the null hypothesis, Assumption TS.3’ 
is true by (11.15), and, as we discussed earlier, serial correlation is not an issue. The homoskedastic- 
ity assumption is Var(y,|y,;) = Var(y,) = ø’, which we just assume is true for now. Under the null 
hypothesis, stock returns are serially uncorrelated, so we can safely assume that they are weakly 
dependent. Then, Theorem 11.2 says we can use the usual OLS f statistic for Êi to test Hp: B, = 0 
against H,: B, # 0. 

The weekly returns in NYSE are computed using data from January 1976 through March 1989. 
In the rare case that Wednesday was a holiday, the close at the next trading day was used. The aver- 
age weekly return over this period was .196 in percentage form, with the largest weekly return being 
8.45% and the smallest being — 15.32% (during the stock market crash of October 1987). Estimation 
of the AR(1) model gives 


—_—_ ~~ 
return, = .180 + .059 return, 
(.081) (.038) [11.16] 
n = 689, R? = .0035, R? = .0020. 
The ¢ statistic for the coefficient on return,_, is about 1.55, and so Hy: 6; = 0 cannot be rejected 
against the two-sided alternative, even at the 10% significance level. The estimate does suggest a 


slight positive correlation in the NYSE return from one week to the next, but it is not strong enough to 
warrant rejection of the EMH. 


In the previous example, using an AR(1) model to test the EMH might not detect correlation 
between weekly returns that are more than one week apart. It is easy to estimate models with more 
than one lag. For example, an autoregressive model of order two, or AR(2) model, is 

vt = Bo at Payi-1 + Poy;-2 + U; [11.17] 
E(uly,—15 Y2 x e) =U. 
There are stability conditions on B, and £, that are needed to ensure that the AR(2) process is weakly 
dependent, but this is not an issue here because the null hypothesis states that the EMH holds: 
Ho: B; = Bo = 0. [11.18] 

If we add the homoskedasticity assumption Var(u,y,_;, ,-2) = o°, we can use a standard F 

statistic to test (11.18). If we estimate an AR(2) model for return,, we obtain 
—_—_ ~~. 
return, = .186 + .060 return,_,; — .038 return,_, 
(.081) (.038) (.038) 
n = 688, R? = .0048, R? = .0019 
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(where we lose one more observation because of the additional lag in the equation). The two lags are 
individually insignificant at the 10% level. They are also jointly insignificant: using R? = .0048, we 
find the F statistic is approximately F = 1.65; the p-value for this F statistic (with 2 and 685 degrees 
of freedom) is about .193. Thus, we do not reject (11.18) at even the 15% significance level. 


Expectations Augmented Phillips Curve 
A linear version of the expectations augmented Phillips curve can be written as 


inf, — inf; = B, (unem, = Mo) + en 


where jy is the natural rate of unemployment and inf; is the expected rate of inflation formed in year 
t — 1. This model assumes that the natural rate is constant, something that macroeconomists ques- 
tion. The difference between actual unemployment and the natural rate is called cyclical unemploy- 
ment, while the difference between actual and expected inflation is called unanticipated inflation. The 
error term, e, is called a supply shock by macroeconomists. If there is a tradeoff between unantici- 
pated inflation and cyclical unemployment, then 6; < 0. [For a detailed discussion of the expecta- 
tions augmented Phillips curve, see Mankiw (1994, Section 11-2).] 

To complete this model, we need to make an assumption about inflationary expectations. Under 
adaptive expectations, the expected value of current inflation depends on recently observed inflation. 
A particularly simple formulation is that expected inflation this year is last year’s inflation: inf? = inf,- 
(See Section 18-1 for an alternative formulation of adaptive expectations.) Under this assumption, we 
can write 


inf, — inf,- = Bo + Byunem, + e, 
or 
Ainf, = Bo + B\unem, + e, 


where Ainf, = inf, — inf, and By = —B Mo. (Bo is expected to be positive, as 6; < 0 and py > 0.) 
Therefore, under adaptive expectations, the expectations augmented Phillips curve relates the change 
in inflation to the level of unemployment and a supply shock, e,. If e, is uncorrelated with unem,, as is 
typically assumed, then we can consistently estimate By and 6, by OLS. (We do not have to assume 
that, say, future unemployment rates are unaffected by the current supply shock.) We assume that 
TS.1’ through TS.5’ hold. Using the data through 2006 in PHILLIPS we estimate 


Tini = 2.82 — .515 unem, 
(1.18) (.202) [11.19] 


n = 58, R? = 0.104, R = 0.089 


The tradeoff between cyclical unemployment and unanticipated inflation is pronounced in equa- 
tion (11.19): a one-point increase in unem lowers unanticipated inflation by over one-half of a point. 
The effect is statistically significant (two-sided p-value ~ .014). We can contrast this with the static 
Phillips curve in Example 10.1, where we found a slightly positive relationship between inflation and 
unemployment. 

Because we can write the natural rate as uo = Bo/(—,), we can use (11.19) to obtain our own 
estimate of the natural rate: fig = By (- B,) = 2.82/.515 ~ 5.48. Thus, we estimate the natural rate 
to be about 5.5, which is well within the range suggested by macroeconomists: historically, 5% to 
6% is a common range cited for the natural rate of unemployment. A standard error of this estimate 
is difficult to obtain because we have a nonlinear function of the OLS estimators. Wooldridge (2010, 
Chapter 3) contains the theory for general nonlinear functions. In the current application, the standard 
error is .577, which leads to an asymptotic 95% confidence interval (based on the standard normal 
distribution) of about 4.35 to 6.61 for the natural rate. 
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GOING FURTHER 11.2 Under Assumptions TS.1' through TS.5’, we 
can show that the OLS estimators are asymptoti- 
Suppose that expectations are formed | cally efficient in the class of estimators described 
as infẹ = (1/2)inf_; + (1/2)inf-2. What | in Theorem 5.3, but we replace the cross-sectional 
regression would you run to estimate the | observation index i with the time series index t. 
epecieüicne elUelnrle iste) Priles cunve Finally, models with trending explanatory variables 
can effectively satisfy Assumptions TS.1’ through 
TS.5’, provided they are trend stationary. As long as time trends are included in the equations when 
needed, the usual inference procedures are asymptotically valid. 


11-3 Using Highly Persistent Time Series in Regression Analysis 


The previous section shows that, provided the time series we use are weakly dependent, usual OLS 
inference procedures are valid under assumptions weaker than the classical linear model assumptions. 
Unfortunately, many economic time series cannot be characterized by weak dependence. Using time 
series with strong dependence in regression analysis poses no problem, if the CLM assumptions in 
Chapter 10 hold. But the usual inference procedures are very susceptible to violation of these assump- 
tions when the data are not weakly dependent, because then we cannot appeal to the LLN and the 
CLT. In this section, we provide some examples of highly persistent (or strongly dependent) time 
series and show how they can be transformed for use in regression analysis. 


11-3a Highly Persistent Time Series 


In the simple AR(1) model (11.2), the assumption |p,| < 1 is crucial for the series to be weakly 
dependent. It turns out that many economic time series are better characterized by the AR(1) model 
with p, = 1. In this case, we can write 


Yt = Yi-1 t en i L 2, e.s [11.20] 


where we again assume that {e; t = 1, 2, . . .} is independent and identically distributed with mean 
zero and variance a2. We assume that the initial value, yọ, is independent of e, for all t = 1. 

The process in (11.20) is called a random walk. The name comes from the fact that y at time ż is 
obtained by starting at the previous value, y,_;, and adding a zero mean random variable that is inde- 
pendent of y,_,. Sometimes, a random walk is defined differently by assuming different properties of 
the innovations, e, (such as lack of correlation rather than independence), but the current definition 
suffices for our purposes. 

First, we find the expected value of y,. This is most easily done by using repeated substitution to get 


y= Gy Fe t So ep T Vp: 
Taking the expected value of both sides gives 
E(y,) = Efe,) + Efe) +++ + Ee) + E(yo) 
= E(yp), forall t= 1. 


Therefore, the expected value of a random walk does not depend on tf. A popular assumption is that 
Yo = O0—the process begins at zero at time zero—in which case, E(y,) = 0 for all t. 

By contrast, the variance of a random walk does change with t. To compute the variance of a 
random walk, for simplicity we assume that yọ is nonrandom so that Var(yọ) = 0; this does not affect 
any important conclusions. Then, by the i.i.d. assumption for {e,}, 


Var(y,) = Var(e,) + Var(e,,) +++: + Var(e,) = ot. [11.21] 
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In other words, the variance of a random walk increases as a linear function of time. This shows that 
the process cannot be stationary. 

Even more importantly, a random walk displays highly persistent behavior in the sense that the 
value of y today is important for determining the value of y in the very distant future. To see this, write 
for h periods hence, 


Vth = Crh T Crp Fo + eni Typ 


Now, suppose at time t, we want to compute the expected value of y,,, given the current value y,. 
Because the expected value of e,, ;, given y, is zero for all j = 1, we have 


E(y,4,1y,) = Yp for all h = 1. [11.22] 


This means that, no matter how far in the future we look, our best prediction of y,, ,, is today’s value, y,. 
We can contrast this with the stable AR(1) case, where a similar argument can be used to show that 


E(y;+ny,) = Py, for all h = 1. 


Under stability, |p,;| < 1, and so E(y,,,,ly,) approaches zero as h — œ: the value of y, becomes less and 
less important, and E(y,..;,|y,) gets closer and closer to the unconditional expected value, E(y,) = 0. 

When h = 1, equation (11.22) is reminiscent of the adaptive expectations assumption we used 
for the inflation rate in Example 11.5: if inflation follows a random walk, then the expected value of 
inf,, given past values of inflation, is simply inf,_;. Thus, a random walk model for inflation justifies 
the use of adaptive expectations. 

We can also see that the correlation between y, and y,,,, is close to one for large t when {y,} fol- 
lows a random walk. If Var(y,) = 0, it can be shown that 


Corr(y,, Yr) = V t(t + h). 


Thus, the correlation depends on the starting point, t (so that {y,} is not covariance stationary). 
Further, although for fixed ¢ the correlation tends to zero as h > œ, it does not do so very quickly. 
In fact, the larger t is, the more slowly the correlation tends to zero as h increases. If we choose h to 
be something large—say, h = 100—we can always choose a large enough ż such that the correlation 
between y, and y,,, is arbitrarily close to one. (If h = 100 and we want the correlation to be greater 
than .95, then t > 1,000 does the trick.) Therefore, a random walk does not satisfy the requirement of 
an asymptotically uncorrelated sequence. 

Figure 11.1 plots two realizations of a random walk, generated from a computer, with initial 
value yy = 0 and e, ~Normal(0, 1). Generally, it is not easy to look at a time series plot and determine 
whether it is a random walk. Next, we will discuss an informal method for making the distinction 
between weakly and highly dependent sequences; we will study formal statistical tests in Chapter 18. 

A series that is generally thought to be well characterized by a random walk is the three-month 
T-bill rate. Annual data are plotted in Figure 11.2 for the years 1948 through 1996. 

A random walk is a special case of what is known as a unit root process. The name comes from 
the fact that p, = 1 in the AR(1) model. A more general class of unit root processes is generated as 
in (11.20), but {e,} is now allowed to be a general, weakly dependent series. [For example, {e,} could 
itself follow an MA(1) or a stable AR(1) process.] When {e,} is not an i.i.d. sequence, the properties 
of the random walk we derived earlier no longer hold. But the key feature of {y,} is preserved: the 
value of y today is highly correlated with y even in the distant future. 

From a policy perspective, it is often important to know whether an economic time series is 
highly persistent or not. Consider the case of gross domestic product in the United States. If GDP 
is asymptotically uncorrelated, then the level of GDP in the coming year is at best weakly related 
to what GDP was, say, 30 years ago. This means a policy that affected GDP long ago has very little 
lasting impact. On the other hand, if GDP is strongly dependent, then next year’s GDP can be highly 
correlated with the GDP from many years ago. Then, we should recognize that a policy that causes a 
discrete change in GDP can have long-lasting effects. 
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FIGURE 11.1 Two realizations of the random walk y; = y,_, + e, with y) = 0, 
e,~ Normal(0, 1), and n = 50. 


FIGURE 11.2 The U.S. three-month T-bill rate, for the years 1948-1996. 
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It is extremely important not to confuse trending and highly persistent behaviors. A series can 
be trending but not highly persistent, as we saw in Chapter 10. Further, factors such as interest rates, 
inflation rates, and unemployment rates are thought by many to be highly persistent, but they have no 
obvious upward or downward trend. However, it is often the case that a highly persistent series also 
contains a clear trend. One model that leads to this behavior is the random walk with drift: 


J= A + Yer +e,t=1,2,..., [11.23] 


where {e,: t = 1, 2,...} and yo satisfy the same properties as in the random walk model. What is new 
is the parameter ay, which is called the drift term. Essentially, to generate y, the constant ay is added 
along with the random noise e, to the previous value y,_;. We can show that the expected value of y, 
follows a linear time trend by using repeated substitution: 


Y = At + e, + ej; T+ + ey + yy 


Therefore, if yy = 0, E(y,) = aot: the expected value of y, is growing over time if a > 0 and shrink- 
ing over time if a < 0. By reasoning as we did in the pure random walk case, we can show that 
E(y,+;ly,) = Qo + y, and so the best prediction of y,,, at time t is y, plus the drift agh. The variance 
of y, is the same as it was in the pure random walk case. 

Figure 11.3 contains a realization of a random walk with drift, where n = 50, yọ = 0, ay = 2, 
and the e, are Normal(0, 9) random variables. As can be seen from this graph, y, tends to grow over 
time, but the series does not regularly return to the trend line. 


FIGURE 11.3 A realization of the random walk with drift, y; = 2 + y,_, + e, with y = 0, 
e,S Normal(0, 9), and n = 50. The dashed line is the expected value of y,, 
E(y,) = 2t. 
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A random walk with drift is another example of a unit root process, because it is the special case 
pı = 1 in an AR(1) model with an intercept: 


Yi 5 A + pry + e 


When p, = 1 and {e,} is any weakly dependent process, we obtain a whole class of highly persistent 
time series processes that also have linearly trending means. 


11-3b Transformations on Highly Persistent Time Series 


Using time series with strong persistence of the type displayed by a unit root process in a regression 
equation can lead to very misleading results if the CLM assumptions are violated. We will study the 
spurious regression problem in more detail in Chapter 18, but for now we must be aware of potential 
problems. Fortunately, simple transformations are available that render a unit root process weakly 
dependent. 

Weakly dependent processes are said to be integrated of order zero, or I(0). Practically, this 
means that nothing needs to be done to such series before using them in regression analysis: averages 
of such sequences already satisfy the standard limit theorems. Unit root processes, such as a random 
walk (with or without drift), are said to be integrated of order one, or I(1). This means that the first 
difference of the process is weakly dependent (and often stationary). A time series that is I(1) is 
often said to be a difference-stationary process, although the name is somewhat misleading with its 
emphasis on stationarity after differencing rather than weak dependence in the differences. 

The concept of an I(1) process is easiest to see for a random walk. With {y,} generated as in 
(11.20) fort = 1,2,..., 


Ay; = Ye = Yi = ep t = 2p Bie eG [11.24] 


therefore, the first-differenced series {Ay,: t = 2, 3, . . .} is actually an i.i.d. sequence. More generally, 
if {y,} is generated by (11.24) where {e,} is any weakly dependent process, then {Ay,} is weakly depend- 
ent. Thus, when we suspect processes are integrated of order one, we often first difference in order to use 
them in regression analysis; we will see some examples later. (Incidentally, the symbol “A” can mean 
“change” as well as “difference.” In actual data sets, if an original variable is named y then its change or 
difference is often denoted cy or dy. For example, the change in price might be denoted cprice.) 

Many time series y, that are strictly positive are such that log(y,) is integrated of order one. In this 
case, we can use the first difference in the logs, Alog(y,) = log(y,) — log(y,_;), in regression analy- 
sis. Alternatively, because 


Alog(y,) = (y; = YY- [11.25] 


we can use the proportionate or percentage change in y, directly; this is what we did in Example 11.4 
where, rather than stating the EMH in terms of the stock price, p, we used the weekly percentage change, 
return, = 100| (p, — p,—1)/P;—1]. The quantity in equation (11.25) is often called the growth rate, meas- 
ured as a proportionate change. When using a particular data set, it is important to know how the growth 
rates are measured—whether as a proportionate or a percentage change. Sometimes if an original variable 
is y its growth rate is denoted gy, so that for each t, gy, = log(y,) — log(y,_1) or gy, = (y, — ye DAH 1 
Often these quantities are multiplied by 100 to turn a proportionate change into a percentage change. 

Differencing time series before using them in regression analysis has another benefit: it removes 
any linear time trend. This is easily seen by writing a linearly trending variable as 


Y= Yo t Vit + Vp 


where v, has a zero mean. Then, Ay, = y; + Av, and so E(Ay,) = yı + E(Av,) = yı. In other words, 
E(Ay,) is constant. The same argument works for Alog(y,) when log(y,) follows a linear time trend. 
Therefore, rather than including a time trend in a regression, we can instead difference those variables 
that show obvious trends. 
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11-3c Deciding Whether a Time Series Is I(1) 


Determining whether a particular time series realization is the outcome of an I(1) versus an I(0) pro- 
cess can be quite difficult. Statistical tests can be used for this purpose, but these are more advanced; 
we provide an introductory treatment in Chapter 18. 

There are informal methods that provide useful guidance about whether a time series process 
is roughly characterized by weak dependence. A very simple tool is motivated by the AR(1) model: 
if |p,| < 1, then the process is I(0), but it is I(1) if p, = 1. Earlier, we showed that, when the AR(1) 
process is stable, p, = Corr(y,, y,,). Therefore, we can estimate p; from the sample correlation 
between y, and y,_;. This sample correlation coefficient is called the first order autocorrelation of 
{y,}; we denote this by 6,. By applying the LLN, À, can be shown to be consistent for p, provided 
|p,| < 1. (However, À; is not an unbiased estimator of p4.) 

We can use the value of p, to help decide whether the process is I(1) or I(0). Unfortunately, 
because fp, is an estimate, we can never know for sure whether p; < 1. Ideally, we could compute a 
confidence interval for p, to see if it excludes the value p; = 1, but this turns out to be rather difficult: 
the sampling distributions of the estimator of 6, are extremely different when p; is close to one and 
when p; is much less than one. (In fact, when p; is close to one, 6, can have a severe downward bias.) 

In Chapter 18, we will show how to test Hp: p, = 1 against H;: p4 < 1. For now, we can only use 
Pp; as a rough guide for determining whether a series needs to be differenced. No hard-and-fast rule 
exists for making this choice. Most economists think that differencing is warranted if 6, > .9; some 
would difference when p,; > .8. 


Fertility Equation 


In Example 10.4, we explained the general fertility rate, gfr, in terms of the value of the personal 
exemption, pe. The first order autocorrelations for these series are very large: 6, = .977 for gfr and 
p; = .964 for pe. These autocorrelations are highly suggestive of unit root behavior, and they raise 
serious questions about our use of the usual OLS ¢ statistics for this example back in Chapter 10. 
Remember, the f statistics only have exact ż distributions under the full set of classical linear model 
assumptions. To relax those assumptions in any way and apply asymptotics, we generally need the 
underlying series to be I(0) processes. 
We now estimate the equation using first differences (and drop the dummy variable, for 
simplicity): 
HOS 
Agfr = —.785 — .043 Ape 
(.502) (.028) [11.26] 
n = 71, R = .032, R = 018. 
Now, an increase in pe is estimated to lower gfr contemporaneously, although the estimate is not sta- 
tistically different from zero at the 5% level. This gives very different results than when we estimated 
the model in levels, and it casts doubt on our earlier analysis. 
If we add two lags of Ape, things improve: 
Agf = —.964 — 036 Ape — .014 Ape_, + .110 Ape- 
(.468) (.027) (.028) (.027) [11.27] 
n = 69, R? = .233, R? = .197. 
Even though Ape and Ape_, have negative coefficients, their coefficients are small and jointly insig- 
nificant (p-value = .28). The second lag is very significant and indicates a positive relationship 
between changes in pe and subsequent changes in gfr two years hence. This makes more sense than 


having a contemporaneous effect. See Computer Exercise C5 for further analysis of the equation in 
first differences. 
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When the series in question has an obvious upward or downward trend, it makes more sense to 
obtain the first order autocorrelation after detrending. If the data are not detrended, the autoregressive 
correlation tends to be overestimated, which biases toward finding a unit root in a trending process. 


Wages and Productivity 


The variable hrwage is average hourly wage in the U.S. economy, and outphr is output per hour. 
One way to estimate the elasticity of hourly wage with respect to output per hour is to estimate the 
equation, 


log(hrwage,) = By + Blog(outphr,) + Bot + u, 


where the time trend is included because log(hrwage,) and log(outphr,) both display clear, upward, 
linear trends. Using the data in EARNS for the years 1947 through 1987, we obtain 
m- rA 
log(hrwage,) = —5.33 + 1.64 log(outphr,) — .018 t 
(.37) (.09) (.002) [11.28] 


n = 41, R = 971, R = .970. 


(We have reported the usual goodness-of-fit measures here; it would be better to report those based 
on the detrended dependent variable, as in Section 10-5.) The estimated elasticity seems too large: 
a 1% increase in productivity increases real wages by about 1.64%. Because the standard error is 
so small, the 95% confidence interval easily excludes a unit elasticity. U.S. workers would prob- 
ably have trouble believing that their wages increase by more than 1.5% for every 1% increase in 
productivity. 

The regression results in (11.28) must be viewed with caution. Even after linearly detrending 
log(hrwage), the first order autocorrelation is .967, and for detrended log(outphr), p, = .945. These 
suggest that both series have unit roots, so we reestimate the equation in first differences (and we no 
longer need a time trend): 


a ea a 

Alog(hrwage,) = —.0036 + .809 Alog(outphr) 

(.0042) (.173) [11.29] 
40, R? = .364, R? = .348. 


n 


Now, a 1% increase in productivity is estimated to increase real wages by about .81%, and the esti- 
mate is not statistically different from one. The adjusted R-squared shows that the growth in output 
explains about 35% of the growth in real wages. See Computer Exercise C2 for a simple distributed 
lag version of the model in first differences. 


In the previous two examples, both the dependent and independent variables appear to have unit 
roots. In other cases, we might have a mixture of processes with unit roots and those that are weakly 
dependent (though possibly trending). An example is given in Computer Exercise C1. 


11-4 Dynamically Complete Models and the Absence of Serial Correlation 


In the AR(1) model in (11.12), we showed that, under assumption (11.13), the errors {u,} must 
be serially uncorrelated in the sense that Assumption TS.5’ is satisfied: assuming that no serial 
correlation exists is practically the same thing as assuming that only one lag of y appears in 


E(yly-1 Y- ++ a 
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Can we make a similar statement for other regression models? The answer is yes, although the 
assumptions required for the errors to be serially uncorrelated might be implausible. Consider, for 
example, the simple static regression model 


y 7 Bo + Biz, T Up [11.30] 


where y, and z, are contemporaneously dated. For consistency of OLS, we only need E(u,|z,) = 0. 
Generally, the {u,} will be serially correlated. However, if we assume that 


E( uz Y1» S19 ++-) = 0, [11.31] 


then (as we will show generally later) Assumption TS.5’ holds. In particular, the {u,} are serially 
uncorrelated. Naturally, assumption (11.31) implies that z, is contemporaneously exogenous, that is, 
E(u,|z,) = 0. 

To gain insight into the meaning of (11.31), we can write (11.30) and (11.31) equivalently as 


E(ylZ Ye Ze -- -) = Elylz,) = Bo + Bız [11.32] 


where the first equality is the one of current interest. It says that, once z, has been controlled for, no 
lags of either y or z help to explain current y. This is a strong requirement and is implausible when the 
lagged dependent variable has predictive power, which is often the case. If the first equality in (11.32) 
does not hold, we can expect the errors to be serially correlated. 

Next, consider a finite distributed lag model with two lags: 


Yı = Bo + Bizi + B1 + B32 + Uy [11.33] 


Because we are hoping to capture the lagged effects that z has on y, we would naturally assume that 
(11.33) captures the distributed lag dynamics: 


E(Yilze Z1; Spo Z3 +- -) = Eizo St Z2); [11.34] 
that is, at most two lags of z matter. If (11.31) holds, we can make a stronger statement: once we have 


controlled for z and its two lags, no lags of y or additional lags of z affect current y: 


E(ylZ Vents Z1 <- -) = Elyl2p Z1 Z-2)- [11.35] 


Equation (11.35) is more likely than (11.32), but it still rules out lagged y having extra predictive 
power for current y. 
Next, consider a model with one lag of both y and z: 


Yı = Bo + Bizi + Boy + B3%—1 + Uy 


Because this model includes a lagged dependent variable, (11.31) is a natural assumption, as it implies 
that 


E(y,lz,, Yt-1> Zt-1> Vi-29 + + .) = E(y,lz, Yt-1> g) 


in other words, once z, y;,—1, and z,_, have been controlled for, no further lags of either y or z affect 
current y. 
In the general model 


Yi = Po + Pixa to + Bextx + Up [11.36] 
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where the explanatory variables x, = (x,,...,X,) may or may not contain lags of y or z, (11.31) 
becomes 


E(u,|x,, Yt- Xr-1 + + .) = 0. [11.37] 
Written in terms of y,, 
E(y |x, Yi- Xp + .) = E(y,x,). [11.38] 


In other words, whatever is in x,, enough lags have been included so that further lags of y and 
the explanatory variables do not matter for explaining y, When this condition holds, we have a 
dynamically complete model. As we saw earlier, dynamic completeness can be a very strong 
assumption for static and finite distributed lag models. 

After we start putting lagged y as explanatory variables, we often think that the model should be 
dynamically complete. We will touch on some exceptions to this claim in Chapter 18. 

Because (11.37) is equivalent to 


E(u, ti-i X1, U2- -) = 0, [11.39] 


we can show that a dynamically complete model must satisfy Assumption TS.5’. (This derivation is 
not crucial and can be skipped without loss of continuity.) For concreteness, take s < t. Then, by the 
law of iterated expectations (see Math Refresher B), 


E(uu, 


X, x,) = E[E(uu,|x,, Xs» us) X; x,] 


= E[u, E(u, 


X,, Xs, Us) X,, x, |, 


where the second equality follows from E(u,u,|x,, X, u,) = u,E(uJx,, X, u,). Now, because s < t, 
(x,, X,, u) is a subset of the conditioning set in (11.39). Therefore, (11.39) implies that 
E(uJx,, X, u,) = 0, and so 


E(uu,\x;, x,) = E(u,-0|x,, x,) = 0, 


which says that Assumption TS.5’ holds. 


Because specifying a dynamically complete 

GOING FURTHER 11.3 model means that there is no serial correlation, does 
If (11.33) holds where u; = e; + aye; and it follow that all models should be dynamically com- 
where {e,} is an i.i.d. sequence with mean plete? As we will see in Chapter 18, for forecasting 
zero and variance oĉ, can equation (11.33) | Purposes, the answer is yes. Some think that all mod- 
be dynamically complete? els should be dynamically complete and that serial 
correlation in the errors of a model is a sign of mis- 
specification. This stance is too rigid. Sometimes, 
we really are interested in a static model (such as a Phillips curve) or a finite distributed lag model 
(such as measuring the long-run percentage change in wages given a 1% increase in productivity). 
In Chapter 12, we will show how to detect and correct for serial correlation in such models. 


EXAMPLE 11.8 Fertility Equation 


In equation (11.27), we estimated a distributed lag model for Agfr on Ape, allowing for two lags of 
Ape. For this model to be dynamically complete in the sense of (11.38), neither lags of Agfr nor fur- 
ther lags of Ape should appear in the equation. We can easily see that this is false by adding Agf,_,: 
the coefficient estimate is .300, and its ¢ statistic is 2.84. Thus, the model is not dynamically complete 
in the sense of (11.38). 
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What should we make of this? We will postpone an interpretation of general models with 
lagged dependent variables until Chapter 18. But the fact that (11.27) is not dynamically complete 
suggests that there may be serial correlation in the errors. We will see how to test and correct for this 
in Chapter 12. 


The notion of dynamic completeness should not be confused with a weaker assumption concern- 
ing including the appropriate lags in a model. In the model (11.36), the explanatory variables x, are 
said to be sequentially exogenous if 


E(uJx,, X,-;,---) = Efu) = 0,t = 1,2,.... [11.40] 


As discussed in Problem 8 in Chapter 10, sequential exogeneity is implied by strict exogeneity and 
sequential exogeneity implies contemporaneous exogeneity. Further, because (x, x,_,.. .) is a subset 
of (X, y,-1, X;-1,-- -), Sequential exogeneity is implied by dynamic completeness. If x, contains y;—1, 
the dynamic completeness and sequential exogeneity are the same condition. The key point is that, when 
x, does not contain y,_;, sequential exogeneity allows for the possibility that the dynamics are not com- 
plete in the sense of capturing the relationship between y, and all past values of y and other explanatory 
variables. But in finite distributed lag models—such as that estimated in equation (11.27)—-we may not 
care whether past y has predictive power for current y. We are primarily interested in whether we have 
included enough lags of the explanatory variables to capture the distributed lag dynamics. For exam- 
ple, if we assume E(ylZ, 2-1, Z2 Za +++) = EC vile Z1 Z-2) = Oo + oz + Siz -1 + 52z,-2, 
then the regressors x, = (z,, z,-1, Z-2) are sequentially exogenous because we have assumed that 
two lags suffice for the distributed lag dynamics. But typically the model would not be dynamically 
complete in the sense that E(y,lz,. y,—15 Z1; Y2 Z2 » » -) = Ely lz, Zz- Zz-2), and we may not care. 
In addition, the explanatory variables in an FDL model may or may not be strictly exogenous. 


11-5 The Homoskedasticity Assumption for Time 
Series Models 


The homoskedasticity assumption for time series regressions, particularly TS.4’, looks very similar 
to that for cross-sectional regressions. However, because x, can contain lagged y as well as lagged 
explanatory variables, we briefly discuss the meaning of the homoskedasticity assumption for differ- 
ent time series regressions. 

In the simple static model, say, 


yt Bo + Biz, + Up [11.41] 


Assumption TS.4’ requires that 


Var(u,|z,) = o°. 


Therefore, even though E(y,|z,) is a linear function of z,, Var(y,z,) must be constant. This is pretty 
straightforward. 

In Example 11.4, we saw that, for the AR(1) model in (11.12), the homoskedasticity 
assumption is 


Var(uly,-1) = Var(y,ly,—-1) = rs 


even though E(y,|y,_,) depends on y,_;, Var(y,ly,_,) does not. Thus, the spread in the distribution of 
y, cannot depend on y,_. 
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Hopefully, the pattern is clear now. If we have the model 


Ye = Bo + Biz + Boy-1 + Bsz-1 + Uy 


the homoskedasticity assumption is 
Var(u,|z,, Y-i E) = Var(y, Zt Yi-1 Z1) = o’, 


so that the variance of u, cannot depend on z, y,_;, Or z,_, (or some other function of time). Generally, 
whatever explanatory variables appear in the model, we must assume that the variance of y,, given 
these explanatory variables, is constant. If the model contains lagged y or lagged explanatory vari- 
ables, then we are explicitly ruling out dynamic forms of heteroskedasticity (something we study in 
Chapter 12). But, in a static model, we are only concerned with Var(y,|z,). In equation (11.41), no 
direct restrictions are placed on, say, Var(y,|y,_ 1). 


Summary 


In this chapter, we have argued that OLS can be justified using asymptotic analysis, provided certain condi- 
tions are met. Ideally, the time series processes are stationary and weakly dependent, although stationarity 
is not crucial. Weak dependence is necessary for applying the standard large sample results, particularly the 
central limit theorem. 

Processes with deterministic trends that are weakly dependent can be used directly in regression anal- 
ysis, provided time trends are included in the model (as in Section 10-5). A similar statement holds for 
processes with seasonality. 

When the time series are highly persistent (they have unit roots), we must exercise extreme caution in 
using them directly in regression models (unless we are convinced the CLM assumptions from Chapter 10 
hold). An alternative to using the levels is to use the first differences of the variables. For most highly 
persistent economic time series, the first difference is weakly dependent. Using first differences changes 
the nature of the model, but this method is often as informative as a model in levels. When data are highly 
persistent, we usually have more faith in first-difference results. In Chapter 18, we will cover some recent, 
more advanced, methods for using I(1) variables in multiple regression analysis. 

When models have complete dynamics in the sense that no further lags of any variable are needed 
in the equation, we have seen that the errors will be serially uncorrelated. This is useful because certain 
models, such as autoregressive models, are assumed to have complete dynamics. In static and distributed 
lag models, the dynamically complete assumption is often false, which generally means the errors will be 
serially correlated. We will see how to address this problem in Chapter 12. 


THE “ASYMPTOTIC” GAUSS-MARKOV ASSUMPTIONS 
FOR TIME SERIES REGRESSION 


Following is a summary of the five assumptions that we used in this chapter to perform large-sample infer- 
ence for time series regressions. Recall that we introduced this new set of assumptions because the time 
series versions of the classical linear model assumptions are often violated, especially the strict exogene- 
ity, no serial correlation, and normality assumptions. A key point in this chapter is that some sort of weak 
dependence is required to ensure that the central limit theorem applies. We only used Assumptions TS. 1’ 
through TS.3’ for consistency (not unbiasedness) of OLS. When we add TS.4’ and TS.5’, we can use the 
usual confidence intervals, t statistics, and F statistics as being approximately valid in large samples. Un- 
like the Gauss-Markov and classical linear model assumptions, there is no historically significant name at- 
tached to Assumptions TS.1’ to TS.5’. Nevertheless, the assumptions are the analogs to the Gauss-Markov 
assumptions that allow us to use standard inference. As usual for large-sample analysis, we dispense with 
the normality assumption entirely. 
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Assumption TS.1’ (Linearity and Weak Dependence) 
The stochastic process {(x,, Xp, -< - > Xæ Y): t = 1,2,...,n} is stationary, weakly dependent, and follows 
the linear model 


Ye = Bo + Bixa + Baxo +i + Bite + Up 


where {u;: t= 1,2,...,n}is the sequence of errors or disturbances. Here, n is the number of observations 
(time periods). 


Assumption TS.2’ (No Perfect Collinearity) 
In the sample (and therefore in the underlying time series process), no independent variable is constant nor 
a perfect linear combination of the others. 


Assumption TS.3’ (Zero Conditional Mean) 
The explanatory variables are contemporaneously exogenous, that is, E(u|x;, - . 
TS.3’ is notably weaker than the strict exogeneity Assumption TS.3’. 


.,X4) = 0. Remember, 


Assumption TS.4’ (Homoskedasticity) 
The errors are contemporaneously homoskedastic, that is, Var(u,|x,) 


(xa XQ. +++ > Xa): 


= g?, where x, is shorthand for 


Assumption TS.5’ (No Serial Correlation) 
For all t # s, E(u,u,|x,, x,) = 0. 


Key Terms 


Asymptotically Uncorrelated 
Autoregressive Process of Order 
One [AR(1)] 
Contemporaneously Exogenous 
Contemporaneously 
Homoskedastic 
Covariance Stationary 
Difference-Stationary Process 
Dynamically Complete Model 
First Difference 


Problems 


First Order Autocorrelation 

Growth Rate 

Highly Persistent 

Integrated of Order One [I(1)] 

Integrated of Order Zero [I(0)] 

Moving Average Process of Order 
One [MA(1)] 

Nonstationary Process 

Random Walk 

Random Walk with Drift 


Sequentially Exogenous 
Serially Uncorrelated 
Stable AR(1) Process 
Stationary Process 
Strongly Dependent 
Trend-Stationary Process 
Unit Root Process 
Weakly Dependent 


1 Let {x;: t = 1,2,...} be a covariance stationary process and define y, = Cov(x, x,,) for h = 0. 
[Therefore, yọ = Var(x,).] Show that Corr(x,, x,4;) = YM Yo: 


2 Let {e; t= —1,0,1,...} bea sequence of independent, identically distributed random variables with 
mean zero and variance one. Define a stochastic process by 


x = e, — (1/2)e 1 + (12)e,_.,t = 1,2,.... 
(i) Find E(x,) and Var(x,). Do either of these depend on £? 
(ii) Show that Corr(x,, x,.;) = — 1/2 and Corr(x, x,+2) = 1/3. (Hint: It is easiest to use the formula 


in Problem 1.) 
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(iii) What is Corr(x, x,,,,) for h > 2? 
(iv) Is {x,} an asymptotically uncorrelated process? 


3 Suppose that a time series process {y,} is generated by y, = z + e, forall t = 1,2,..., where {e,} is 


an i.i.d. sequence with mean zero and variance 02. The random variable z does not change over time; it 
has mean zero and variance a2. Assume that each e, is uncorrelated with z. 

(i) Find the expected value and variance of y, Do your answers depend on ¢? 

(ii) Find Cov(y,, y,+,,) for any ft and A. Is {y,} covariance stationary? 

(iii) Use parts (i) and (ii) to show that Corr(y,, y,.,) = 02/(o2 + 02) for all t and A. 

(iv) Does y, satisfy the intuitive requirement for being asymptotically uncorrelated? Explain. 


Let {yt = 1,2,...} follow a random walk, as in (11.20), with yo = 0. Show that 
Corr(y,, Yan) = Wit + h) fort = 1,h > 0. 


For the U.S. economy, let gprice denote the monthly growth in the overall price level and let gwage 
be the monthly growth in hourly wages. [These are both obtained as differences of logarithms: 
gprice = Alog(price) and gwage = Alog(wage).] Using the monthly data in WAGEPRC, we esti- 
mate the following distributed lag model: 


gprice = —.00093 + .119 gwage + .097 gwage_, + .040 gwage_, 


(.00057) (.052) (.039) (.039) 
+ .038 gwage_, + .081 gwage_, + .107 gwage_; + .095 gwage_. 
(.039) (.039) (.039) (.039) 
+ .104 gwage_, + .103 gwage_g + .159 gwage_, + .110 gwage_jo 
(.039) (.039) (.039) (.039) 
+ .103 gwage_,, + .016 gwage_,, 
(.039) (.052) 


n = 273, R = 317, R = .283. 


(i) Sketch the estimated lag distribution. At what lag is the effect of gwage on gprice largest? 
Which lag has the smallest coefficient? 

(ii) For which lags are the f statistics less than two? 

(iii) What is the estimated long-run propensity? Is it much different than one? Explain what the LRP 
tells us in this example. 

(iv) What regression would you run to obtain the standard error of the LRP directly? 

(v) How would you test the joint significance of six more lags of gwage? What would be the dfs in 
the F distribution? (Be careful here; you lose six more observations.) 


Let hy6, denote the three-month holding yield (in percent) from buying a six-month T-bill at time 
(t — 1) and selling it at time t (three months hence) as a three-month T-bill. Let hy3,_, be the three- 
month holding yield from buying a three-month T-bill at time (t — 1). At time (t — 1), hy3,_, is 
known, whereas /y6, is unknown because p3, (the price of three-month T-bills) is unknown at time 
(t — 1). The expectations hypothesis (EH) says that these two different three-month investments 
should be the same, on average. Mathematically, we can write this as a conditional expectation: 


E(hy6,I,-1) = hy3,-1, 
where J,_, denotes all observable information up through time t— 1. This suggests estimating the model 
hy6, = Bo + Byhy3,-) + u, 


and testing Ho: B; = 1. (We can also test Ho: By = 0, but we often allow for a term premium for buying 
assets with different maturities, so that Bọ # 0.) 


(i) 


(ii) 


(iii) 


(iv) 
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Estimating the previous equation by OLS using the data in INTQRT (spaced every three 
months) gives 


“hy6, = —.058 + 1.104 hy3,_, 
(.070) (.039) 
n = 123, R? = 866. 


Do you reject Hp: 6, = 1 against Hp: 8B; # 1 at the 1% significance level? Does the estimate 
seem practically different from one? 

Another implication of the EH is that no other variables dated as t — 1 or earlier should help 
explain hy6,, once hy3,_,; has been controlled for. Including one lag of the spread between six- 
month and three-month T-bill rates gives 


rns, 
hy6, = —.123 + 1.053 hy3,_, + .480 (r6,_, — r3,-1) 
(.067) (.039) (.109) 
n = 123, R? = .885. 


Now is the coefficient on hy3,_, statistically different from one? Is the lagged spread term 
significant? According to this equation, if, at time ¢ — 1, r6 is above r3, should you invest in 
six-month or three-month T-bills? 

The sample correlation between hy3, and hy3,_, is .914. Why might this raise some concerns 
with the previous analysis? 

How would you test for seasonality in the equation estimated in part (ii)? 


7 A partial adjustment model is 


y: = Yo E YX, + e, 
Yi — Yi = AY; — y1) + an 


where y, is the desired or optimal level of y and y; is the actual (observed) level. For example, y; is 
the desired growth in firm inventories, and x, is growth in firm sales. The parameter yı measures the 
effect of x, on y;. The second equation describes how the actual y adjusts depending on the relationship 
between the desired y in time ¢ and the actual y in time tf — 1. The parameter A measures the speed of 
adjustment and satisfies 0 < A < 1. 


(i) 


(ii) 


(iii) 


Plug the first equation for y; into the second equation and show that we can write 


Ye = Bo + Biyi-1 + Box, + uy. 


In particular, find the £; in terms of the y; and A and find u, in terms of e, and a,. Therefore, 
the partial adjustment model leads to a model with a lagged dependent variable and a 
contemporaneous x. 

If E(e,|x,, Mit At-1 + + .) = Ela, 
would you estimate the 6;? 

If Êi = .7 and Bo = .2, what are the estimates of y, and A? 


Xp Y1 X15». -) = O and all series are weakly dependent, how 


8 Suppose that the equation 


y= a+ Ot + Bixa +--+ + Blik + uy 


satisfies the sequential exogeneity assumption in equation (11.40). 


(i) 


Suppose you difference the equation to obtain 


Ay, = 6 + B,Ax, +- + B,Ax, + Au, 


Why does applying OLS on the differenced equation not generally result in consistent 
estimators of the B;? 
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(ii) 


(iii) 


What assumption on the explanatory variables in the original equation would ensure that OLS 
on the differences consistently estimates the 6;? 

Let Z,..., Z be a set of explanatory variables dated contemporaneously with y,. If we specify 
the static regression model y, = By + BiZa + °°: + BZ, + u, describe what we need to assume 
for x, = z, to be sequentially exogenous. Do you think the assumptions are likely to hold in 
economic applications? 


Computer Exercises 


C1 Use the data in HSEINV for this exercise. 


C2 


C3 


(i) 


(ii) 


(iii) 


(iv) 


Find the first order autocorrelation in log(invpc). Now, find the autocorrelation after linearly 
detrending log(invpc). Do the same for log(price). Which of the two series may have a unit root? 
Based on your findings in part (i), estimate the equation 


log(invpc,) = Bo + B\Alog(price,) + Bot + u, 


and report the results in standard form. Interpret the coefficient B, and determine whether it is 
statistically significant. 

Linearly detrend log(invpc,) and use the detrended version as the dependent variable in the 
regression from part (ii) (see Section 10-5). What happens to R?? 

Now use Alog(invpc,) as the dependent variable. How do your results change from part (ii)? Is 
the time trend still significant? Why or why not? 


In Example 11.7, define the growth in hourly wage and output per hour as the change in the natu- 
ral log: ghrwage = Alog(hrwage) and goutphr = Alog(outphr). Consider a simple extension of the 
model estimated in (11.29): 


ghrwage, = By + B,goutphr, + B.goutphr,-, + u. 


This allows an increase in productivity growth to have both a current and lagged effect on wage growth. 


(i) 


(ii) 


(iii) 
(i) 


(ii) 
(iii) 


(iv) 


Estimate the equation using the data in EARNS and report the results in standard form. Is the 
lagged value of goutphr statistically significant? 

If 8B, + By = 1, a permanent increase in productivity growth is fully passed on in higher wage 
growth after one year. Test Hp: 8; + B2 = 1 against the two-sided alternative. Remember, one 
way to do this is to write the equation so that 0 = B, + b2 appears directly in the model, as in 
Example 10.4 from Chapter 10. 

Does goutphr,— need to be in the model? Explain. 


In Example 11.4, it may be that the expected value of the return at time t, given past returns, is a 
quadratic function of return,—,. To check this possibility, use the data in NYSE to estimate 


return, = By + By,return,_, + Boreturn?_, +u; 


report the results in standard form. 

State and test the null hypothesis that E(return,return,_;) does not depend on return,_. 
(Hint: There are two restrictions to test here.) What do you conclude? 

Drop return;_, from the model, but add the interaction term return,—,-return,—>. Now test the 
efficient markets hypothesis. 

What do you conclude about predicting weekly stock returns based on past stock returns? 


C4 Use the data in PHILLIPS for this exercise, but only through 2006. 


(i) 


In Example 11.5, we assumed that the natural rate of unemployment is constant. An alternative 
form of the expectations augmented Phillips curve allows the natural rate of unemployment to 
depend on past levels of unemployment. In the simplest case, the natural rate at time ¢ equals 


C5 


C6 


C7 
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unem,_,. If we assume adaptive expectations, we obtain a Phillips curve where inflation and 
unemployment are in first differences: 


Ainf = Bo + ByAunem + u. 


Estimate this model, report the results in the usual form, and discuss the sign, size, and 
statistical significance of B;. 
Gi) Which model fits the data better, (11.19) or the model from part (i)? Explain. 


(i) Add a linear time trend to equation (11.27). Is a time trend necessary in the first-difference 
equation? 

(ii) Drop the time trend and add the variables ww2 and pill to (11.27) (do not difference these 
dummy variables). Are these variables jointly significant at the 5% level? 

(iii) Add the linear time trend, ww2, and pill all to equation (11.27). What happens to the magnitude 
and statistical significance of the time trend as compared with that in part (i)? What about the 
coefficient on pill as compared with that in part (ii)? 

(iv) Using the model from part (iii), estimate the LRP and obtain its standard error. Compare this 
to (10.19), where gfr and pe appeared in levels rather than in first differences. Would you say 
that the link between fertility and the value of the personal exemption is a particularly robust 
finding? 


Let inven, be the real value inventories in the United States during year t, let GDP, denote real gross 
domestic product, and let r3, denote the (ex post) real interest rate on three-month T-bills. The ex post 
real interest rate is (approximately) r3, = i3, — inf,, where i3, 1s the rate on three-month T-bills and inf, 
is the annual inflation rate [see Mankiw (1994, Section 6-4)]. The change in inventories, cinven,, is the 
inventory investment for the year. The accelerator model of inventory investment relates cinven to the 
cGDP, the change in GDP: 


cinven, = By + B\cGDP, + u, 


where B, > 0. [See, for example, Mankiw (1994), Chapter 17.] 

(i) Use the data in INVEN to estimate the accelerator model. Report the results in the usual form 
and interpret the equation. Is Ê; statistically greater than zero? 

(ii) If the real interest rate rises, then the opportunity cost of holding inventories rises, and so an 
increase in the real interest rate should decrease inventories. Add the real interest rate to the 
accelerator model and discuss the results. 

(iii) Does the level of the real interest rate work better than the first difference, cr3,? 


Use CONSUMP for this exercise. One version of the permanent income hypothesis (PIH) of con- 

sumption is that the growth in consumption is unpredictable. [Another version is that the change in 

consumption itself is unpredictable; see Mankiw (1994, Chapter 15) for discussion of the PIH.] Let 

gc, = log(c,) — log(c,_,) be the growth in real per capita consumption (of nondurables and services). 

Then the PIH implies that E(gc,|J,_,;) = E(gc,), where Z; denotes information known at time (t — 1); 

in this case, t denotes a year. 

(i) Test the PIH by estimating gc, = By) + B,gc,-; + u,. Clearly state the null and alternative 
hypotheses. What do you conclude? 

(ii) To the regression in part (i) add the variables gy,_,, (3,_;, and inf,_,. Are these new variables 
individually or jointly significant at the 5% level? (Be sure to report the appropriate p-values.) 

(iii) In the regression from part (ii), what happens to the p-value for the ¢ statistic on gc,_,? Does this 
mean the PIH hypothesis is now supported by the data? 

(iv) In the regression from part (ii), what is the F statistic and its associated p-value for joint 
significance of the four explanatory variables? Does your conclusion about the PIH now agree 
with what you found in part (1)? 
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C8 


c9 


C10 


C11 


Use the data in PHILLIPS for this exercise. 

(i) Estimate an AR(1) model for the unemployment rate using only the data through the year 2003. 
Use this equation to predict the unemployment rate for 2004. Compare this with the actual 
unemployment rate for 2004. 

(ii) Add a lag of inflation to the AR(1) model from part (i), again being sure to use only the data 
through 2003. Is inf,_, statistically significant? 

(iii) Use the equation from part (ii) to predict the unemployment rate for 2004. Is the result better or 
worse than in the model from part (i)? 

(iv) Use the method from Section 6-4 to construct a 95% prediction interval for the 2004 
unemployment rate. Is the 2004 unemployment rate in the interval? 


Use the data in TRAFFIC2 for this exercise. Computer Exercise C11 in Chapter 10 previously asked 

for an analysis of these data. 

(i) | Compute the first order autocorrelation coefficient for the variable prcfat. Are you concerned 
that prcfat contains a unit root? Do the same for the unemployment rate. 

(ii) Estimate a multiple regression model relating the first difference of prcfat, Aprcfat, to the 
same variables in part (vi) of Computer Exercise C11 in Chapter 10, except you should first 
difference the unemployment rate, too. Then, include a linear time trend, monthly dummy 
variables, the weekend variable, and the two policy variables; do not difference these. Do you 
find any interesting results? 

(iii) Comment on the following statement: “We should always first difference any time series we 
suspect of having a unit root before doing multiple regression because it is the safe strategy 
and should give results similar to using the levels.” [In answering this, you may want to do the 
regression from part (vi) of Computer Exercise C11 in Chapter 10, if you have not already. ] 


Use all the data in PHILLIPS to answer this question. You should now use 56 years of data. 

(i) Reestimate equation (11.19) and report the results in the usual form. Do the intercept and slope 
estimates change notably when you add the recent years of data? 

(ii) Obtain a new estimate of the natural rate of unemployment. Compare this new estimate with 
that reported in Example 11.5. 

(iii) Compute the first order autocorrelation for unem. In your opinion, is the root close to one? 

(iv) Use cunem as the explanatory variable instead of unem. Which explanatory variable gives a 
higher R-squared? 


Okun’s Law—see, for example, Mankiw (1994, Chapter 2)—implies the following relationship 
between the annual percentage change in real GDP, pcrgdp, and the change in the annual unemploy- 
ment rate, cunem: 


pergdp = 3 — 2° cunem. 


If the unemployment rate is stable, real GDP grows at 3% annually. For each percentage point 

increase in the unemployment rate, real GDP grows by two percentage points less. (This should not 

be interpreted in any causal sense; it is more like a statistical description.) 
To see if the data on the U.S. economy support Okun’s Law, we specify a model that allows 

deviations via an error term, pcrgdp, = By + B,cunem, + u, 

(i) | Use the data in OKUN to estimate the equation. Do you get exactly 3 for the intercept and —2 
for the slope? Did you expect to? 

(ii) Find the ż statistic for testing Ho: 8B; = —2. Do you reject Hy against the two-sided alternative at 
any reasonable significance level? 

(iii) Find the ¢ statistic for testing Hy: By = 3. Do you reject Hy at the 5% level against the two-sided 
alternative? Is it a “strong” rejection? 

(iv) Find the F statistic and p-value for testing Hp: By = 3, 6; = —2 against the alternative that Ho 
is false. Does the test reject at the 10% level? Overall, would you say the data reject or tend to 
support Okun’s Law? 


C12 


C13 


C14 
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Use the data in MINWAGE for this exercise, focusing on the wage and employment series for sector 

232 (Men’s and Boys’ Furnishings). The variable gwage232 is the monthly growth (change in logs) in 

the average wage in sector 232; gemp232 is the growth in employment in sector 232; gmwage is the 

growth in the federal minimum wage; and gcpi is the growth in the (urban) Consumer Price Index. 

(i) Find the first order autocorrelation in gwage232. Does this series appear to be weakly 
dependent? 

(ii) Estimate the dynamic model 


gwage232, = By + Bigwage232,_, + B.gmwage, + B3gcpi, + u, 


by OLS. Holding fixed last month’s growth in wage and the growth in the CPI, does an increase 
in the federal minimum wage result in a contemporaneous increase in gwage232,? Explain. 

(iii) Now add the lagged growth in employment, gemp232,_,, to the equation in part (ii). Is it 
statistically significant? 

(iv) Compared with the model without gwage232,_, and gemp232,_,, does adding the two lagged 
variables have much of an effect on the gmwage coefficient? 

(v) Run the regression of gmwage, on gwage232,_, and gemp232,_,, and report the R-squared. 
Comment on how the value of R-squared helps explain your answer to part (iv). 


Use the data in BEVERIDGE to answer this question. The data set includes monthly observations on 

vacancy rates and unemployment rates for the United States from December 2000 through February 

2012. 

(i) Find the correlation between urate and urate_1. Would you say the correlation points more 
toward a unit root process or a weakly dependent process? 

(ii) Repeat part (i) but with the vacancy rate, vrate. 

(iii) The Beveridge Curve relates the unemployment rate to the vacancy rate, with the simplest 
relationship being linear: 


urate, = By + B,vrate, + u, 


where 6, < 0 is expected. Estimate By and B, by OLS and report the results in the usual form. 
Do you find a negative relationship? 

(iv) Explain why you cannot trust the confidence interval for 8, reported by the OLS output in part 
(iii). [The tools needed to study regressions of this type are presented in Chapter 18.] 

(v) If you difference urate and vrate before running the regression, how does the estimated slope 
coefficient compare with part (111)? Is it statistically different from zero? [This example shows 
that differencing before running an OLS regression is not always a sensible strategy. But we 
cannot say more until Chapter 18.] 


Use the data in APPROVAL to answer the following questions. See also Computer Exercise C14 in 

Chapter 10. 

(i) | Compute the first order autocorrelations for the variables approve and Irgasprice. Do they seem 
close enough to unity to worry about unit roots? 

(ii) Consider the model 


approve, = By + B,lcpifood, + B,lrgasprice, + B,unemploy, 
+ Bysep11, + Bsiraginvade, + u, 


where the first two variables are in logarithmic form. Given what you found in part (i), 
why might you hesitate to estimate this model by OLS? 
(iii) Estimate the equation in part (ii) by differencing all variables (including the dummy variables). 
How do you interpret your estimate of 8,? Is it statistically significant? (Report the p-value.) 
(iv) Interpret your estimate of 6, and discuss its statistical significance. 
(v) Add /sp500 to the model in part (ii) and estimate the equation by first differencing. Discuss 
what you find for the stock market variable. 


Serial Correlation and 
Heteroskedasticity in 
Time Series Regressions 
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n this chapter, we discuss the critical problem of serial correlation in the error terms of a multiple 
regression model. We saw in Chapter 11 that when, in an appropriate sense, the dynamics of a model have 
been completely specified, the errors will not be serially correlated. Thus, testing for serial correlation 
can be used to detect dynamic misspecification. Furthermore, static and finite distributed lag models often 
have serially correlated errors even if there is no underlying misspecification of the model. Therefore, it is 
important to know the consequences and remedies for serial correlation for these useful classes of models. 
The structure of this chapter is similar to Chapter 8, at least in the early sections. In Section 12-1, 
we present the properties of OLS when the errors contain serial correlation. In Section 12-2, we show 
how to compute standard errors for the OLS estimators that allow for general forms of serial cor- 
relation. This topic used to be considered more advanced and often was not treated in introductory 
books. However, just like it has become common in cross-sectional applications to use OLS and 
adjust the standard errors for general heteroskedasticity, it has become common to use OLS for time 
series regressions and compute serial correlation-robust standard errors. In Section 12-3, we study the 
problem of testing for serial correlation. We cover tests that apply to models with strictly exogenous 
regressors and tests that are asymptotically valid with general regressors, including lagged dependent 
variables. Section 12-4 explains how to correct for serial correlation when we have a specific model 
in mind, provided we are willing to assume that the explanatory variables are strictly exogenous. 
Section 12-5 shows how using differenced data often eliminates serial correlation in the errors. 
In Chapter 8, we discussed testing and correcting for heteroskedasticity in cross-sectional appli- 


cations. In Section 12-6, we show how the methods used in the cross-sectional case can be extended 
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to the time series case. The mechanics are essentially the same, but there are a few subtleties associ- 


ated with the temporal correlation in time series observations that must be addressed. In addition, we 


briefly touch on the consequences of dynamic forms of heteroskedasticity. 


12-1 Properties of OLS with Serially Correlated Errors 


12-1a Unbiasedness and Consistency 


In Chapter 10, we proved unbiasedness of the OLS estimator under the first three Gauss-Markov 
assumptions for time series regressions (TS.1 through TS.3). In particular, Theorem 10.1 assumed 
nothing about serial correlation in the errors. It follows that, as long as the explanatory variables are 
strictly exogenous, the Ê; are unbiased, regardless of the degree of serial correlation in the errors. This 
is analogous to the observation that heteroskedasticity in the errors does not cause bias in the 6. 

In Chapter 11, we relaxed the strict exogeneity assumption to E(u,|x,) = 0, or even zero correlation, 
and showed that, when the data are weakly dependent, the Ê; are still consistent (although not necessar- 
ily unbiased). This result did not hinge on any assumption about serial correlation in the errors. 


12-1b Efficiency and Inference 


Because the Gauss-Markov Theorem (Theorem 10.4) requires both homoskedasticity and serially 
uncorrelated errors, OLS is no longer BLUE in the presence of serial correlation. Even more impor- 
tantly, the usual OLS standard errors and test statistics are not valid, even asymptotically. We can see 
this by computing the variance of the OLS estimator under the first four Gauss-Markov assumptions 
and the AR(1) serial correlation model for the error terms. More precisely, we assume that 


U, = pu,- +e, t= 1,2,...,n [12.1] 
lel <1, [12.2] 


where the e, are uncorrelated random variables with mean zero and variance g2; recall from 
Chapter 11 that assumption (12.2) is the stability condition. 
We consider the variance of the OLS slope estimator in the simple regression model 


Y = Bo a Bix, + Up 


and, just to simplify the formula, we assume that the sample average of the x, is zero (x = 0). Then, 
the OLS estimator 6, of 6; can be written as 


Êi = Bi + SST È xu, [12.3] 
t=1 


where SST, = >7_,.x7. Now, in computing the variance of Êi (conditional on X), we must account 
for the serial correlation in the u,: 


Var(B,) = sst, "Val Sx) 
t=1 


n n=1n=t 
= ssTP?( Safvar(u) +25 Sis) Blut) [12.4] 
t=1 t=1 j=1 
n—in-t 
= o°/SST, + 2(0°/SST?) X Spt 


t=1j=1 
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where o? = Var(u,) and we have used the fact that E(uu,+;) = Cov(u, U,+;) = po” [see equation 
(11.4)]. The first term in equation (12.4), a’/SST,, is the variance of Êi when p = 0, which is the 
familiar OLS variance under the Gauss-Markov assumptions. If we ignore the serial correlation and 
estimate the variance in the usual way, the variance estimator will usually be biased when p # 0 
because it ignores the second term in (12.4). As we will see through later examples, p > 0 is most 
common, in which case, p/ > 0 for all j. Further, the independent variables in regression models 
are often positively correlated over time, so that Xr j is positive for most pairs t and ¢+j. Therefore, in 
most economic applications, the term >) D j=1 XXa; jis positive, and so the usual OLS variance 
formula o’°/SST, understates the true variance of the OLS estimator. If p is large or x, has a high 
degree of positive serial correlation—a common case—the bias in the usual OLS variance estimator 
can be substantial. We will tend to think the OLS slope estimator is more precise than it actually is. 

When p < 0, př is negative when j is odd and 
GOING FURTHER 12.1 positive when j is even, and so it is difficult to deter- 


. . n Do t 
Suppose that, rather than the AR(1) model, | mine the sign of Da j=1 PXA j In fact, it is pos- 
u, follows the MA(1) model u; = e, + ae,_;. | sible that the usual OLS variance formula actually 


Find Var(B,) and show that it is different overstates the true variance of Ê.. In either case, the 
from the usual formula if a # O. usual variance estimator will be biased for Var(B,) 
in the presence of serial correlation. 

Because the standard error of Êi is an estimate of the standard deviation of Ê. using the usual 
OLS standard error in the presence of serial correlation is invalid. Therefore, ¢ statistics are no longer 
valid for testing single hypotheses. Because a smaller standard error means a larger t statistic, the 
usual f statistics will often be too large when p > 0. The usual F and LM statistics for testing multiple 
hypotheses are also invalid. 


12-1c Goodness of Fit 


Sometimes one sees the claim that serial correlation in the errors of a time series regression model 
invalidates our usual goodness-of-fit measures, R-squared and adjusted R-squared. Fortunately, this 
is not the case, provided the data are stationary and weakly dependent. To see why these meas- 
ures are still valid, recall that we defined the population R-squared in a cross-sectional context 
to be 1 — o7/o% (see Section 6-3). This definition is still appropriate in the context of time series 
regressions wath stationary, weakly dependent data: the variances of both the error and the depend- 
ent variable do not change over time. By the law of large numbers, R? and R? both consistently 
estimate the population R-squared. The argument is essentially the same as in the cross-sectional 
case in the presence of heteroskedasticity (see Section 8-1). Because there is never an unbiased 
estimator of the population R-squared, it makes no sense to talk about bias in R? caused by serial 
correlation. All we can really say is that our goodness-of-fit measures are still consistent estimators 
of the population parameter. This argument does not go through if {y,} is an I(1) process because 
Var(y,) grows with t; goodness of fit does not make much sense in this case. As we discussed in 
Section 10-5, trends in the mean of y,, or seasonality, can and should be accounted for in computing 
an R-squared. Other departures from stationarity do not cause difficulty in interpreting R? and R? in 
the usual ways. 


12-1d Serial Correlation in the Presence 
of Lagged Dependent Variables 
Beginners in econometrics are often warned of the dangers of serially correlated errors in the presence 


of lagged dependent variables. Almost every textbook on econometrics contains some form of the 
statement “OLS is inconsistent in the presence of lagged dependent variables and serially correlated 
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errors.” Unfortunately, as a general assertion, this statement is false. There is a version of the state- 
ment that is correct, but it is important to be very precise. 
To illustrate, suppose that the expected value of y, given y,_, is linear: 


E(yly,-1) = Bo + Biy,-1, [12.5] 


where we assume stability, |6,| < 1. We know we can always write this with an error term as 


Yı = Bo + Biyi-1 + Up [12.6] 
E(u,ly,-1) = 0. [12.7] 


By construction, this model satisfies the key zero conditional mean Assumption TS.3’ for consistency of 
OLS; therefore, the OLS estimators Bo and B , are consistent. It is important to see that, without further 
assumptions, the errors {u,} can be serially correlated. Condition (12.7) ensures that u, is uncorrelated 
with y,_ ,, but u, and y,_, could be correlated. Then, because u,_, = y,_1 — Bo — B1y,—2, the covariance 
between u, and u,_, is —B,Cov(u, y,->), which is not necessarily zero. Thus, the errors exhibit serial 
correlation and the model contains a lagged dependent variable, but OLS consistently estimates By and 6; 
because these are the parameters in the conditional expectation (12.5). The serial correlation in the errors 
will cause the usual OLS statistics to be invalid for testing purposes, but it will not affect consistency. 

So when is OLS inconsistent if the errors are serially correlated and the regressors contain a 
lagged dependent variable? This happens when we write the model in error form, exactly as in (12.6), 
but then we assume that {u,} follows a stable AR(1) model as in (12.1) and (12.2), where 


E(e,|u,—1, Ut- <) = B(e,ly,—1, Yi- :) =0. [12.8] 


Because e, is uncorrelated with y,_, by assumption, Cov(y,_ |, u,) = pCov(y,—1, u,—1) which is not 
zero unless p = 0. This causes the OLS estimators of 8, and 8, from the regression of y, on y,_, to 
be inconsistent. 

We now see that OLS estimation of (12.6) when the errors u, also follow an AR(1) model leads 
to inconsistent estimators. However, the correctness of this statement makes it no less wrongheaded. 
We have to ask: What would be the point in estimating the parameters in (12.6) when the errors fol- 
low an AR(1) model? It is difficult to think of cases where this would be interesting. At least in (12.5) 
the parameters tell us the expected value of y, given y,_;. When we combine (12.6) and (12.1), we 
see that y, really follows a second-order autoregressive model, or AR(2) model. To see this, write 
uj, = Y,-1 — Po — Biy;—2 and plug this into u, = pu,_, + e, Then, (12.6) can be rewritten as 


Ye = Bo + Biyy-1 + ply- = Bo = BiYı-2) +e, 
= Bo(1 = p) + (Bi + P)Yi-1 — pBiy,-2 + e, 
= a + Qy;,-1 + AY + en 


where ay) = Bo(1 — p), a; = B, + p, and a, = —pf,. Given (12.8), it follows that 
E(yly,-15 Yi- :) = E(yly,-15 y,-2) = a + QY,- T QoY,—-2- [12.9] 


This means that the expected value of y,, given all past y, depends on two lags of y. It is equation (12.9) 
that we would be interested in using for any practical purpose, including forecasting, as we will see in 
Chapter 18. We are especially interested in the parameters a;. Under the appropriate stability condi- 
tions for an AR(2) model—which we will cover in Section 12-4—OLS estimation of (12.9) produces 
consistent and asymptotically normal estimators of the aj. 

The bottom line is that you need a good reason for having both a lagged dependent variable in a 
model and a particular model of serial correlation in the errors. Often, serial correlation in the errors 
of a dynamic model simply indicates that the dynamic regression function has not been completely 
specified: in the previous example, we should add y,_, to the equation. 
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In Chapter 18, we will see examples of models with lagged dependent variables where the errors 
are serially correlated and are also correlated with y,_,. But even in these cases the errors do not 
follow an autoregressive process. 


12-2 Serial Correlation—Robust Inference after OLS 


Ordinary least squares is attractive for time series analysis because it does not require more than 
contemporaneous exogeneity for consistency. For this reason, it has become common in recent years 
to use OLS and then obtain standard errors that are valid in the presence of fairly arbitrary forms of 
serial correlation (as well as heteroskedasticity). Such standard errors are now computed routinely in 
many econometrics packages. 

To illustrate how one obtains serial correlation—robust standard errors, consider equation (12.4), 
which is the variance of the OLS slope estimator in a simple regression model with AR(1) errors. We 
can estimate this variance very simply by plugging in our standard estimators of p and a”. The only 
problems with this are that it assumes the AR(1) model holds and also assumes homoskedasticity. It is 
possible to relax both of these assumptions. 

A general treatment of standard errors that are both heteroskedasticity- and serial correlation— 
robust is given in Davidson and MacKinnon (1993), among other places. Here, we provide a simple 
method to compute the robust standard error of any OLS coefficient. 

Our treatment here follows Wooldridge (1989). Consider the standard multiple linear regression 
model 


y, = Bo + Bix, +--+ + Bixa t upt = 1,2,...,0, [12.10] 


which we have estimated by OLS. For concreteness, we are interested in obtaining a serial correlation— 
robust standard error for 6,. This turns out to be fairly easy. Write x,, as a linear function of the 
remaining independent variables and an error term, 


a = 99 + Oxy +--+ Oty + Tp 


where the error r, has zero mean and is uncorrelated with xp, Xg, . -< 5 Xi 
Then, it can be shown that the asymptotic variance of the OLS estimator 6, is 


AVar(Ê )= (Se? )) Va| Sru) 


Under the no serial correlation Assumption TS.5’, {a, = ru,} is serially uncorrelated, so either the 
usual OLS standard errors (under homoskedasticity) or the heteroskedasticity-robust standard errors 
will be valid. But if TS.5’ fails, our expression for AVar( Ê.) must account for the correlation between 
a, and a,, when t # s. In practice, it is common to assume that, once the terms are farther apart than a 
few periods, the correlation is essentially zero. Remember that under weak dependence, the correla- 
tion must be approaching zero, so this is a reasonable approach. 

Following the general framework of Newey and West (1987), Wooldridge (1989) shows that 
AVar(B,) can be estimated as follows. Let “se(ĝ,)” denote the usual (but incorrect) OLS standard 
error and let & be the usual standard error of the regression (or root mean squared error) from estimat- 
ing (12.10) by OLS. Let 7, denote the residuals from the auxiliary regression of 


Xa ON Xp, XB, ~~» 5 Xr [12.11] 
(including a constant, as usual). For a chosen integer g > 0, define 


p= Sa + 2311 ee ni ` aða) [12.12] 


t=h+1 
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where 


This looks somewhat complicated, but in practice it is easy to obtain. The integer g in (12.12), often 
called the truncation lag, controls how much serial correlation we are allowing in computing the 
standard error. Once we have ? the serial correlation-robust standard error of 8, is simply 


se(B,) = [“se(B,) PV 2. [12.13] 


In other words, we take the usual OLS standard error of Ê |, divide it by Ò, square the result, and then 
multiply by the square root of f. This can be used to construct confidence intervals and ż statistics for 


Bi. 
It is useful to see what ? looks like in some simple cases. When g = 1, 


PS Sa + Daa [12.14] 
=2 
and when g = 2, 


p= Sar + (43)( Zaa) re (2)( Zaa) [12.15] 
t=2 t=3 


The larger g is, the more terms are included to correct for serial correlation. The purpose of the fac- 
tor [1 — h/(g + 1)] in (12.12) is to ensure that f is in fact nonnegative [Newey and West (1987) verify 
this]. We clearly need f = 0, as Î is estimating a variance and the square root of @ appears in (12.13). 

The standard error in (12.13) is also robust to arbitrary heteroskedasticity. (In the time series 
literature, the serial correlation—robust standard errors are sometimes called heteroskedasticity and 
autocorrelation consistent (HAC) standard errors. In fact, if we drop the second term in (12.12), 
then (12.13) becomes the usual heteroskedasticity-robust standard error that we discussed in Chapter 8 
(without the degrees of freedom adjustment). 

The theory underlying the standard error in (12.13) is technical and somewhat subtle. 
Remember, we started off by claiming we do not know the form of serial correlation. If this is the 
case, how can we select the truncation lag g, which must be an integer? Theory states that (12.13) 
works for fairly arbitrary forms of serial correlation, provided g grows with sample size n. The idea 
is that, with larger sample sizes, we can be more flexible about the amount of correlation in (12.12). 
There has been much work on the relationship between g and n, but we will not go into details 
here. For annual data, choosing a small g, such as g = 1 or g = 2, is likely to account for most of 
the serial correlation. For quarterly or monthly data, g should probably be larger (such as g = 4 
or 8 for quarterly and g = 12 or 24 for monthly), assuming that we have enough data. In comput- 
ing the Newey-West standard errors, the econometrics program Eviews® uses the integer part of 
4(n/100)*”, which appears in Newey and West (1994) but only in a preliminary stage, not in the final 
choice of g. Newey and West (1994) actually recommend a multiple of n'?, where the multiple is 
obtained from the data in an initial stage. Stock and Watson (2014, Chapter 15) recommend taking g 
to be the integer part of g = (3/4)n'%, which they obtain using results by Andrews (1991) under the 
nominal assumption that the errors follow an AR(1) process with p =.5. Others have suggested the 
integer part of n'*, For, say, n = 70, which is reasonable for annual, data from post-World War II, 
4(70/100)*” = 3. 695, (3/4)(70)'* = 3. 091, and (70)'" ~= 2. 893. The integer parts are 3, 3, and 2, 
respectively. You are invited to see what happens as n increases to, say, n = 280, which would be 
70 years of quarterly data. 

We summarize how to obtain a serial correlation—robust standard error for B 1- Of course, because 
we can list any independent variable first, the following procedure works for computing a standard 
error for any slope coefficient. Although not common, it is even possible to use a different truncation 
lag for different coefficients. 
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Serial Correlation-Robust Standard Error for ~,: 

(i) Estimate (12.10) by OLS, which yields “se(f,),” &, and the OLS residuals {@,; t = 1,..., n}. 

(ii) Compute the residuals {f;: t = 1,...,} from the auxiliary regression (12.11). Then, form 
a, = f,i, (for each 2). 

(iii) For your choice of g, compute Ŷ as in (12.12). 

(iv) Compute se(B,) from (12.13). 

The previous procedure can be used with any software that supports basic OLS regression. Many 
professional econometrics packages have built-in commands for so-called HAC standard errors, and 
so carrying out the separate steps is not necessary. Describing how those commands work requires 
matrix algebra, but the idea is very similar to the procedure just provided. 

Empirically, the serial correlation—robust standard errors are typically larger than the usual OLS 
standard errors when there is serial correlation. This is true because, in most cases, the errors are posi- 
tively serially correlated. However, it is possible to have substantial serial correlation in {u,} but to also 
have similarities in the usual and serial correlation—robust (SC-robust) standard errors of some coef- 
ficients: it is the sample autocorrelations of â, = 7,ii, that determine the robust standard error for Bi. 

After their initial introduction, the use of SC-robust (HAC) standard errors somewhat lagged 
behind the use of standard errors robust only to heteroskedasticity. One reason is that large cross sec- 
tions, where the heteroskedasticity-robust standard errors generally have good properties, are more 
common than large time series. The Newey-West standard errors can be poorly behaved when there 
is substantial serial correlation and the sample size is small (where small can even be as large as, say, 
100). A second reason that one might hesitate to compute a Newey-West standard error is that the 
bandwidth g in equation (12.12) needs to be chosen. Therefore, the computation of a Newey-West 
standard error is not automatic unless you allow the econometrics packages to use a rule of thumb. 
Even if you do so, you still have to abide by the choice. Unfortunately, the standard errors can be sen- 
sitive to the choice of g. See Kiefer and Vogelsang (2005) for discussion. 

Nevertheless, Newey-West standard errors, and variants, are now in widespread use for static 
and distributed lag models estimated by OLS. As we have discussed at several points, consistency 
of OLS does not require strict exogeneity of the explanatory variables, which makes it attractive 
when future outcomes on the explanatory variables may react to changes in the current error, u, If the 
explanatory variables are strictly exogenous, then we may be able to improve over OLS by general- 
ized least squares procedures, which we discuss in Section 12-4. One may be forced to at least try a 
GLS approach if the Newey-West standard errors seem too large to allow us to learn about true effects. 


The Puerto Rican Minimum Wage 


In Chapter 10 (see Example 10.9), we estimated the effect of the minimum wage on the Puerto Rican employ- 
ment rate. We now compute a Newey-West standard error for the OLS coefficient on the minimum wage 
variable. We add log(prgnp) as an additional control variable, as in Computer Exercise C3 in Chapter 10. 
Therefore, the explanatory variables are log(mincov), log(usgnp), log(prgnp), and a linear time trend. 

The OLS estimate of the elasticity of the employment rate with respect to the minimum wage 
is B, = —.2123 and the usual OLS standard error is “se(B,)” = .0402. The standard error of the 
regression is @ = .0328. Further, using the previous procedure with g = 2 [see (12.15)], we obtain 
D = .000805. This gives the SC-robust standard error as se(B,) = (.0402/.0328)? V .000805 = .0426. 
As we might expect, the SC-robust (HAC) standard error is greater than the OLS standard error, but it 
is only about 6% larger. The HAC f statistic is about —4.98, and so the estimated elasticity is still very 
statistically significant. (The robust confidence interval is slightly wider, of course.) We can now have 
more confidence in our inference because we have accounted for serial correlation, to some extent, in 
computing the standard error for OLS. 
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Kiefer and Vogelsang (2005) provide a different way to obtain valid inference in the presence 
of arbitrary serial correlation. Rather than worry about the rate at which g is allowed to grow (as a 
function of n) in order for the f statistics to have asymptotic standard normal distributions, Kiefer 
and Vogelsang derive the large-sample distribution of the ¢ statistic when b = (g + 1)/n is allowed 
to settle down to a nonzero fraction. [In the Newey-West setup, (g + 1)/n always converges to zero.] 
For example, when b = 1, g = n — 1, which means that we include every covariance term in equa- 
tion (12.12). The resulting ¢ statistic does not have a large-sample standard normal distribution, but 
Kiefer and Vogelsang show that it does have an asymptotic distribution, and they tabulate the appro- 
priate critical values. For a two-sided, 5% level test, the critical value is 4.771, and for a two-sided 
10% level test, the critical value is 3.764. Compared with the critical values from the standard normal 
distribution, we need a ¢ statistic substantially larger. But we do not have to worry about choosing the 
number of covariances in (12.12). 

Before leaving this section, we note that it is possible to construct SC-robust (HAC) F-type sta- 
tistics for testing multiple hypotheses, but these are too advanced to cover here. [See Wooldridge 
(1991b, 1995) and Davidson and MacKinnon (1993) for treatments.] Many econometric packages 
compute such statistics routinely after OLS regression, but, of course, we need to specify a truncation 
lag in the Newey-West estimator. 


12-3 Testing for Serial Correlation 


The methods of the previous section show that we can obtain standard errors for the OLS esti- 
mators that are valid in the presence of general forms of serial correlation (and heteroskedastic- 
ity). Consequently, in principle there may be no reason to go beyond using OLS and computing 
so-called HAC standard errors and test statistics. Nevertheless, there are a few reasons one might 
want to test for the presence of serial correlation in the errors. First, using a HAC estimator requires 
us to choose a bandwidth, and different choices made by different researchers lead to different 
standard errors—even if there is no serial correlation. Therefore, if one cannot detect fairly simple 
forms of serial correlation via testing, it may be prudent not to adjust the standard errors for serial 
correlation. 

A second reason for testing for serial correlation is that we may be able to obtain more efficient 
estimators, at least if the explanatory variables are strictly exogenous. Without fairly strong evidence 
of serial correlation it would be pointless to pursue estimation strategies that improve over OLS only 
in the presence of serial correlation. The decision is essentially the same as moving from OLS to 
weighted least squares in a cross-sectional context: we would first want evidence of heteroskedastic- 
ity before implementing WLS. We discuss how generalized least squares can be used to account for 
serial correlation in Section 12-4. 

Finally, one might have specified a model such that the errors, ideally, should not have serial cor- 
relation. This is especially true when the goal is forecasting and the model includes lagged depend- 
ent variables. We can use the presence of serial correlation as a simple diagnostic to indicate that 
our model is missing lags of at least some of the variables, and more attention is needed to model 
specification. 

In this section we study the problem of testing for serial correlation in the errors {u; t = 1,2,...} 
in the standard linear model 


Ye = Bo + Bix Fs > Bete + Up 


We first consider the case in which the regressors are strictly exogenous. Recall that this requires the 
error, u, to be uncorrelated with the regressors in all time periods (see Section 10-3), so, among other 
things, it rules out models with lagged dependent variables. 
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12-3a A tTest for AR(1) Serial Correlation with 
Strictly Exogenous Regressors 


Although there are numerous ways in which the error terms in a multiple regression model can be 
serially correlated, the most popular model—and the simplest to work with—is the AR(1) model in 
equations (12.1) and (12.2). In Section 12.1, we explained the implications of performing OLS when 
the errors are serially correlated in general, and we derived the variance of the OLS slope estimator 
in a simple regression model with AR(1) errors. In Section 12.2 we showed how to modify the OLS 
standard errors to allow for general serial correlation and heteroskedasticity. We now show how to test 
for the presence of AR(1) serial correlation. The null hypothesis is that there is no serial correlation. 
Therefore, just as with tests for heteroskedasticity, we assume the best and require the data to provide 
reasonably strong evidence that the ideal assumption of no serial correlation is violated. 

We first derive a large-sample test under the assumption that the explanatory variables are strictly 
exogenous: the expected value of u,, given the entire history of independent variables, is zero. In addi- 
tion, in (12.1), we must assume that 


E(e,u,— 1, t-23...) = 0 [12.16] 
and 
Var(e,u,-,) = War(e,) = 0%. [12.17] 


These are standard assumptions in the AR(1) model (which follow when {e,} is an i.i.d. sequence), 
and they allow us to apply the large-sample results from Chapter 11 for dynamic regression. 

As with testing for heteroskedasticity, the null hypothesis is that the appropriate Gauss-Markov 
assumption is true. In the AR(1) model, the null hypothesis that the errors are serially uncorrelated is 


Ho: p = 0. [12.18] 


How can we test this hypothesis? If the u, were observed, then, under (12.16) and (12.17), we could 
immediately apply the asymptotic normality results from Theorem 11.2 to the dynamic regression 
model 


U, = pu,- + ept = 2,...,n. [12.19] 


(Under the null hypothesis p = 0, {u,} is clearly weakly dependent.) In other words, we could estimate 
p from the regression of u, on u,—,, for allt = 2, ..., n, without an intercept, and use the usual f statistic 
for p. This does not work because the errors u, are not observed. Nevertheless, just as with testing for 
heteroskedasticity, we can replace u, with the corresponding OLS residual, i, Because i, depends on the 
OLS estimators Bos Bi, hig Be it is not obvious that using i, for u, in the regression has no effect on the 
distribution of the ¢ statistic. Fortunately, it turns out that, because of the strict exogeneity assumption, 
the large-sample distribution of the ¢ statistic is not affected by using the OLS residuals in place of the 
errors. A proof is well beyond the scope of this text, but it follows from the work of Wooldridge (199 1b). 
We can summarize the asymptotic test for AR(1) serial correlation very simply. 


Testing for AR(1) Serial Correlation with Strictly Exogenous Regressors: 


(i) Run the OLS regression of y, on x,,...,%X, and obtain the OLS residuals, ĉ, for all 
t=1,2,...,n. 
(ii) Run the regression of 


ñon it,_1, for allt = 2,...,n, [12.20] 


obtaining the coefficient 6 on i,_, and its ¢ statistic, t,. (This regression may or may not contain 
an intercept; the f statistic for À will be slightly affected, but it is asymptotically valid either way.) 
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(iii) Use t; to test Hp: p = 0 against H,: p # 0 in the usual way. (Actually, because p > 0 is often 
expected a priori, the alternative can be H,: p > 0.) Typically, we conclude that serial correla- 
tion is a problem to be dealt with only if Hp is rejected at the 5% level. As always, it is best to 
report the p-value for the test. 


In deciding whether serial correlation needs to be addressed, we should remember the differ- 
ence between practical and statistical significance. With a large sample size, it is possible to find 
serial correlation even though ĵ is practically small; when p is close to zero, the usual OLS inference 
procedures will not be far off [see equation (12.4)]. Such outcomes are somewhat rare in time series 
applications because time series data sets are usually small. 


Testing for AR(1) Serial Correlation in the Phillips Curve 


In Chapter 10, we estimated a static Phillips curve that explained the inflation-unemployment trade- 
off in the United States (see Example 10.1). In Chapter 11, we studied a particular expectations- 
augmented Phillips curve, where we assumed adaptive expectations (see Example 11.5). We now test 
the error term in each equation for serial correlation. Because the expectations-augmented curve uses 
Ainf, = inf, — inf,- as the dependent variable, we have one fewer observation. 

For the static Phillips curve estimated using the data through 2006, the regression in (12.20) 
yields p = .571, t = 5.48, and p-value = .000 (n = 58 observations, with the first lost due to lagging 
the residual). This is very strong evidence of positive, first-order serial correlation. One consequence 
of this finding is that the standard errors and f statistics from Chapter 10 are not valid, and we should 
compute a HAC standard error for the slope. By contrast, the test for AR(1) serial correlation in 
the expectations-augmented Phillips curve gives ô = —.033, t = —.29, and p-value = .773 (with 57 
observations): there is no evidence of AR(1) serial correlation in the expectations-augmented Phillips 
curve. We also prefer the expectations-augmented version because the estimated tradeoff (negative 
slope) makes more sense in terms of economic theory. 


Although the test from (12.20) is derived from the AR(1) model, the test can detect other kinds of 
serial correlation. Remember, p is a consistent estimator of the correlation between u, and u,_,. Any 
serial correlation that causes adjacent errors to be correlated can be picked up by this test. On the other 
hand, it does not detect serial correlation where adjacent errors are uncorrelated, Corr(u,, u,_,) = 0. 
(For example, u, and u,_» could be correlated.) 

In using the usual ż statistic from (12.20), we must assume that the errors in (12.19) sat- 
isfy the appropriate homoskedasticity assumption, (12.17). In fact, it is easy to make the test 
robust to heteroskedasticity in e, we simply use the usual, 


GOING FURTHER 12.2 heteroskedasticity-robust f statistic from Chapter 8. For the static 


How would you use regression (12.20) to 


Phillips curve in Example 12.2, the heteroskedasticity-robust ż statis- 


construct an approximate 95% confidence tic is 3.98, which is smaller than the nonrobust f statistic but still very 


interval for p? 


significant. In Section 12-6, we further discuss heteroskedasticity in 
time series regressions, including its dynamic forms. 


12-3b The Durbin-Watson Test under Classical Assumptions 


Another test for AR(1) serial correlation is the Durbin-Watson test. The Durbin-Watson (DW) 
statistic is also based on the OLS residuals: 


DW = l [12.21] 
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Simple algebra shows that DW and p from (12.20) are closely linked: 


DW ~ 2(1 — ĵ). [12.22] 


One reason this relationship is not exact is that ô has 5)" _ ,ĝ—; in its denominator, while the DW 
statistic has the sum of squares of all OLS residuals in its denominator. Even with moderate sample 
sizes, the approximation in (12.22) is often pretty close. Therefore, tests based on DW and the ¢ test 
based on p are conceptually the same. 

Durbin and Watson (1950) derive the distribution of DW (conditional on X), something that 
requires the full set of classical linear model assumptions, including normality of the error terms. 
Unfortunately, this distribution depends on the values of the independent variables. (It also depends 
on the sample size, the number of regressors, and whether the regression contains an intercept.) 
Although some econometrics packages tabulate critical values and p-values for DW, many do not. In 
any case, they depend on the full set of CLM assumptions. 

Several econometrics texts report upper and lower bounds for the critical values that depend on 
the desired significance level, the alternative hypothesis, the number of observations, and the number 
of regressors. (We assume that an intercept is included in the model.) Usually, the DW test is com- 
puted for the alternative 


Hy: p > 0. [12.23] 


From the approximation in (12.22), 6 = 0 implies that DW = 2, and 6 = 0 implies that DW < 2. 
Thus, to reject the null hypothesis (12.18) in favor of (12.23), we are looking for a value of DW that is 
significantly less than two. Unfortunately, because of the problems in obtaining the null distribution 
of DW, we must compare DW with two sets of critical values. These are usually labeled as d, (for 
upper) and d, (for lower). If DW < d,, then we reject Hy in favor of (12.23); if DW > dy, we fail to 
reject Ho. If d, = DW = dy, the test is inconclusive. 

As an example, if we choose a 5% significance level with n = 45 and k = 4, dy = 1.720 and 
d, = 1.336 [see Savin and White (1977)]. If DW < 1.336, we reject the null of no serial correlation 
at the 5% level; if DW > 1.72, we fail to reject Ho; if 1.336 =< DW & 1.72, the test is inconclusive. 

In Example 12.2, for the static Phillips curve, DW is computed to be DW = .80. We can obtain 
the lower 1% critical value from Savin and White (1977) fork = 1 andn = 50: d, = 1.32. Therefore, 
we reject the null of no serial correlation against the alternative of positive serial correlation at the 1% 
level. (Using the previous ż test, we can conclude that the p-value equals zero to three decimal places.) 
For the expectations-augmented Phillips curve, DW = 1.77, which is well within the fail-to-reject 
region at even the 5% level (dy = 1.59). 

The fact that an exact sampling distribution for DW can be tabulated is the only advantage that 
DW has over the ¢ test from (12.20). Given that the tabulated critical values are exactly valid only 
under the full set of CLM assumptions and that they can lead to a wide inconclusive region, the 
practical disadvantages of the DW statistic are substantial. The f statistic from (12.20) is simple to 
compute and asymptotically valid without normally distributed errors. The ż statistic is also valid in 
the presence of heteroskedasticity that depends on the x,;. Plus, it is easy to make it robust to any form 
of heteroskedasticity. 


12-3¢c Testing for AR(1) Serial Correlation without Strictly 
Exogenous Regressors 
When the explanatory variables are not strictly exogenous, so that one or more x, are correlated with 


u,—,, neither the ¢ test from regression (12.20) nor the Durbin-Watson statistic are valid, even in large 
samples. The leading case of nonstrictly exogenous regressors occurs when the model contains a 
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lagged dependent variable: y,_ , and u,_, are obviously correlated. Durbin (1970) suggested two alter- 
natives to the DW statistic when the model contains a lagged dependent variable and the other regres- 
sors are nonrandom (or, more generally, strictly exogenous). The first is called Durbin’s h statistic. 
This statistic has a practical drawback in that it cannot always be computed, so we do not cover 
it here. 

Durbin’s alternative statistic is simple to compute and is valid when there are any number of non- 
strictly exogenous explanatory variables. The test also works if the explanatory variables happen to be 
strictly exogenous. 


Testing for Serial Correlation with General Regressors: 


(i) Run the OLS regression of y, on x,,...,xX, and obtain the OLS residuals, a,, for all 
t=1,2,...,n 


(ii) Run the regression of 
n On Qrp arRo +> Xp fort = 2,...,n [12.24] 


to obtain the coefficient 6 on i, , and its f statistic, t; 


(iii) Use ft; to test Hp: p = 0 against H,: p # O in the usual way (or use a one-sided alternative). 


In equation (12.24), we regress the OLS residuals on all independent variables, including an intercept, 
and the lagged residual. The ż statistic on the lagged residual is a valid test of (12.18) in the AR(1) model 
(12.19) [when we add Var(u,|x,, u,;-,) = o° under Hp]. Any number of lagged dependent variables may 
appear among the x,, and other nonstrictly exogenous explanatory variables are allowed as well. 

The inclusion of x,,...,X, explicitly allows for each x, to be correlated with u,_,, and this 
ensures that t; has an approximate ¢ distribution in large samples. The ¢ statistic from (12.20) ignores 
possible correlation between x, and u,-;, sO it is not valid without strictly exogenous regressors. 
Incidentally, because iz, = y, Bo Bixa he Bens it can be shown that the f statistic on i,_ , is 
the same if y, is used in place of i, as the dependent variable in (12.24). 

The ż statistic from (12.24) is easily made robust to heteroskedasticity of unknown form [in par- 
ticular, when Var(u,|x,, u,—,) is not constant]: just use the heteroskedasticity-robust f statistic on i,_ |. 


Testing for AR(1) Serial Correlation in the Minimum Wage Equation 


In Example 12.1, we computed a HAC standard error for the estimated elasticity of the employment 
rate with respect to the minimum wage. We now test whether the errors in the equation exhibit AR(1) 
serial correlation, using the test that does not assume strict exogeneity of the minimum wage or GNP 
variables. We are assuming that the underlying stochastic processes are weakly dependent, but we 
allow them to contain a linear time trend by including ¢ in the regression. 

Letting ĉ, denote the OLS residuals, we run the regression of 


i,on it,_,, log(mincov), log(usgnp), log(prgnp), t 


using the 37 available observations. The estimated coefficient on ĉ,—; is 6 = .481 with t = 2.89 (two- 
sided p-value = .007). Therefore, there is strong evidence of AR(1) serial correlation in the errors, 
which means the f statistics for the Ê, that we obtained before are not valid for inference. Remember, 
though, the Ê, j are still consistent if u, is contemporaneously uncorrelated with each explanatory var- 
iable. Incidentally, if we use regression (12.20) instead, we obtain p = .417 and ft = 2.63, so the 
outcome of the test is similar in this case. Interestingly, although there is strong evidence of positive 
serial correlation, the HAC standard error in Example 12.1 is only slightly larger than the usual OLS 
standard error. 
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12-3d Testing for Higher-Order Serial Correlation 


The test from (12.24) is easily extended to higher orders of serial correlation. For example, suppose 
that we wish to test 


Ho: Pi ~ 0, P2 = 0 [12.25] 
in the AR(2) model, 
Uy = Piti- + pou;—2 + &. 


This alternative model of serial correlation allows us to test for second-order serial correlation. 
As always, we estimate the model by OLS and obtain the OLS residuals, ĉ,. Then, we can run the 
regression of 


On it, 1, l-2 Xr Xm -< o Xo fort = 3,...,n 


to obtain the F test for joint significance of a#,_, and i,_. If these two lags are jointly significant 
at a small enough level, say, 5%, then we reject (12.25) and conclude that the errors are serially 
correlated. 

More generally, we can test for serial correlation in the autoregressive model of order q: 


Uy = Piti-1 + Poltyg t + Pgly—g + Cp [12.26] 
The null hypothesis is 
Ho: Pi = 0, P2 = 0, eisg Pq = 0. [12.27] 


Testing for AR(q) Serial Correlation: 


(i) Run the OLS regression of y, on x,,, . . . , X and obtain the OLS residuals, #,, for allt = 1, 2,..., 7. 


(ii) Run the regression of 
Aon Bia jg baei s log Mae Mee 5 te forall t= (g f i)n [12.28] 


(iii) Compute the F test for joint significance of @,_ |, @,,..., ,—, in (12.28). [The F statistic with 
y, as the dependent variable in (12.28) can also be used, as it gives an identical answer. ] 


If the x, are assumed to be strictly exogenous, so that each x, is uncorrelated with u,— 1, Up—2» - - - , Ur—q> 
then the x, can be omitted from (12.28). Including the x, in the regression makes the test 
valid with or without the strict exogeneity assumption. The test requires the homoskedasticity 
assumption 
Var(ulX, i- - -> l-g) = o’. [12.29] 

A heteroskedasticity-robust version can be computed as described in Chapter 8. 

An alternative to computing the F test is to use the Lagrange multiplier (LM) form of the statistic. 
(We covered the LM statistic for testing exclusion restrictions in Chapter 5 for cross-sectional 
analysis.) The LM statistic for testing (12.27) is simply 


LM = (n — q)R} [12.30] 


where R? is just the usual R-squared from regression (12.28). Under the null hypothesis, LM ~ X 
This is usually called the Breusch-Godfrey test for AR(q) serial correlation. The LM statistic 
also requires (12.29), but it can be made robust to heteroskedasticity. [For details, see Wooldridge 
(1991b).] 
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Testing for AR(3) Serial Correlation 


In the event study of the barium chloride industry (see Example 10.5), we used monthly data, so we 
may wish to test for higher orders of serial correlation. For illustration purposes, we test for AR(3) 
serial correlation in the errors underlying equation (10.22). Using regression (12.28), we find the F 
statistic for joint significance of @,_|, 2, and a,_3 is F = 5.12. Originally, we had n = 131, and 
we lose three observations in the auxiliary regression (12.28). Because we estimate 10 parameters in 
(12.28) for this example, the df in the F statistic are 3 and 118. The p-value of the F statistic is .0023, 
so there is strong evidence of AR(3) serial correlation. If we were trying to publish the findings of the 
estimates reported in (10.22), we should use Newey-West standard errors, probably with a truncated 
lag of three or four given the sample size of 131. 


With quarterly or monthly data that have not been seasonally adjusted, we sometimes wish to 
test for seasonal forms of serial correlation. For example, with quarterly data, we might postulate the 
autoregressive model 


U, = P4sl,—4 + E. [12.31] 


From the AR(1) serial correlation tests, it is pretty clear how to proceed. When the regressors are 
strictly exogenous, we can use a f test on ii,_ 4 in the regression of 


û on i,_4, for allt = 5,..., n. 


A modification of the Durbin-Watson statistic is also available [see Wallis (1972)]. When the x,; are 
not strictly exogenous, we can use the regression in (12.24), with #,_4 replacing i,_. 
In Example 12.4, the data are monthly and are not seasonally 
GOING FURTHER 12.3 adjusted. Therefore, it makes sense to test for correlation between 
Suppose you have quarterly data and you | “ and u,_,,. A regression of û, on i,_, yields p,. = —.187 and 
want to test for the presence of first-order or p-value = .028, so there is evidence of negative seasonal autocor- 
fourth-order serial correlation. With strictly | relation. (Including the regressors changes things only modestly: 
exogenous regressors, how would you Py. = —.170 and p-value = .052.) This is somewhat unusual and 
proceed? does not have an obvious explanation. 


12-4 Correcting for Serial Correlation with Strictly 
Exogenous Regressors 


If we detect serial correlation after applying one of the tests in Section 12-3, we need a remedy. If 
our goal is to estimate a model with complete dynamics, we need to respecify the model, probably by 
including more lags. In applications where our goal is not to estimate a fully dynamic model, we need 
to find a way to carry out valid statistical inference. 

In Section 12-1 we saw explicitly why the usual OLS standard errors and test statistics are no 
longer valid in the presence of serial correlation. Section 12-2 shows how to construct standard errors 
that are robust to general forms of serial correlation and heteroskedasticity—so called HAC standard 
errors—using a method popularized by Newey and West (1987). As discussed in Section 12-2, using 
OLS and correcting the standard errors is attractive because we only have to maintain contemporane- 
ous exogeneity of the explanatory variables. (In a distributed lag model, contemporaneous exogene- 
ity is effectively the same as sequential exogeneity—that is, we have a sufficient number of lags to 
account for any lagged effect.) The main downside to using OLS with Newey-West standard errors, 
other than having to choose a truncation lag, is that the OLS estimators may be imprecise. In particu- 
lar, the Newey-West standard errors might be so large that confidence intervals are wide. As in any 
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statistical or econometric context, we might estimate a large practical effect, but with large standard 
errors the estimate might not be statistically significant. 

If the Newey-West standard errors for OLS seem too large to be useful, we have another option: 
We can model the serial correlation and then apply a generalized least squares procedure. This is the 
more traditional solution to solving the problem of serial correlation in {u; t = 1,2,... }, dating back 
to Cochrane and Orcutt (1949); see also Prais and Winsten (1954). Modeling the serial correlation and 
using feasible GLS is analogous to using weighted least squares after modeling heteroskedasticity in 
cross-sectional regression. 

The original studies of the statistical properties GLS applied to the problem of serial correlation 
assumed that the regressors were nonrandom, or fixed in repeated samples. We now know that the 
key requirement for consistency of GLS-based serial correlation corrections is that the regressors are 
strictly exogenous. (We know strict exogeneity is needed for OLS to be unbiased, but it is not neces- 
sary for consistency.) Therefore, one should be confident that the explanatory variables satisfy at least 
some form of strict exogeneity, to be made precise in what follows, before applying GLS methods. 
We certainly should not use GLS methods to estimate models with lagged dependent variables, but 
we also know there are other interesting cases in which strict exogeneity fails. The fact that GLS 
requires stronger exogeneity assumptions partly explains why using OLS with HAC standard errors 
has become increasingly popular. The potential gain from assuming strict exogeneity and using GLS 
is that we can obtain estimators that are more (asymptotically) efficient than OLS. Also, if we are suc- 
cessful in modeling the serial correlation, inference is simplified. 

We begin our treatment with the most commonly used model of serial correlation, the AR(1) model. 


12-4a Obtaining the Best Linear Unbiased 
Estimator in the AR(1) Model 


We start by assuming the first four Gauss-Markov assumptions hold, TS.1 through TS.4, but we relax 
Assumption TS.5. In particular, we assume that the errors follow the AR(1) model 


u, = pu,_, + e, forallt = 1,2,.... [12.32] 


Remember that Assumption TS.3 implies that u, has a zero mean conditional on X. In the following 
analysis, we let the conditioning on X be implied in order to simplify the notation. Thus, we write the 
variance of u, as 


Var(u,) = 02/(1 — p°). [12.33] 
For simplicity, consider the case with a single explanatory variable: 
y, = Bo + Bix, + u, for all t = 1,2,...,n. 


Because the problem in this equation is serial correlation in the u, it makes sense to transform the 
equation to eliminate the serial correlation. For t = 2, we write 


Yi-1 = Bo + BiX-1 + -1 
yı = Bo + Bix, + u, 


Now, if we multiply the first equation by p and subtract it from the second equation, we get 
Yı — PHa1 = (1 — p)Bo + BiG — phi) + ent = 2, 
where we have used the fact that e, = u, — pu,—,. We can write this as 


Yı = (1 — p)Bo + Bix, + ept = 2, [12.34] 
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where 


Jı =} = Øi- % 54 ~ i-i [12.35] 


are called the quasi-differenced data. (If p = 1, these are differenced data, but remember we are 
assuming |p| < 1.) The error terms in (12.34) are serially uncorrelated; in fact, this equation satisfies 
all of the Gauss-Markov assumptions. This means that, if we knew p, we could estimate £ and 6, by 
regressing y, on X,, provided we divide the estimated intercept by (1 — p). 

The OLS estimators from (12.34) are not quite BLUE because they do not use the first time 
period. This is easily fixed by writing the equation for t = 1 as 


yı = Bo + Bix, + u. [12.36] 


Because each e, is uncorrelated with u,, we can add (12.36) to (12.34) and still have serially uncor- 
related errors. However, using (12.33), Var(u,) = 02/(1 — p*) > o2 = Var(e,). [Equation (12.33) 
clearly does not hold when |p| = 1, which is why we assume the stability condition.] Thus, we must 
multiply (12.36) by (1 — p”)'” to get errors with the same variance: 


=p y = (1 pe Pelee) as le u 
or 


Jı = (1 — p*)'?By + Bix, + w, [12.37] 


where 7%, = (1 — p*)'?m, J, = (1 — p)? 
Var(i@,) = (1 — p*)Var(u,) = 02, so we can use (12.37) along with (12.34) in an OLS regression. 
This gives the BLUE estimators of By and 6, under Assumptions TS.1 through TS.4 and the AR(1) 
model for u,. This is another example of a generalized least squares (or GLS) estimator. We saw other 
GLS estimators in the context of heteroskedasticity in Chapter 8. 


Adding more regressors changes very little. For t = 2, we use the equation 


Jı = (1 — p)Bo + Bita + + Bee + ep [12.38] 


y,, and so on. The error in (12.37) has variance 


where Xj = xj — px,—1,; Fort = 1, we have y, = (1 — p’)'”y,, Xy = (1 — p’) xj and the inter- 
ceptis (1 — p°)" Bo. For given p, it is fairly easy to transform the data and to carry out OLS. Unless 
p = 0, the GLS estimator, that is, OLS on the transformed data, will generally be different from 
the original OLS estimator. The GLS estimator turns out to be BLUE, and, because the errors in 
the transformed equation are serially uncorrelated and homoskedastic, f and F statistics from the 
transformed equation are valid (at least asymptotically, and exactly if the errors e, are normally 
distributed). 


12-4b Feasible GLS Estimation with AR(1) Errors 


The problem with the GLS estimator is that p is rarely known in practice. However, we already know 
how to get a consistent estimator of p: we simply regress the OLS residuals on their lagged counter- 
parts, exactly as in equation (12.20). Next, we use this estimate, p, in place of p to obtain the quasi- 
differenced variables. We then use OLS on the equation 


Yı = Box + Bixn + + Brže + error, [12.39] 


where Xo = (1 — p) fort = 2, and Xio = (1 — p’)!”. This results in the feasible GLS (FGLS) esti- 
mator of the 6;. The error term in (12.39) contains e, and also the terms involving the estimation error 
in Ô. Fortunately, the estimation error in p does not affect the asymptotic distribution of the FGLS 
estimators. 
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Feasible GLS Estimation of the AR(1) Model: 


(i) Run the OLS regression of y, on x4, . - - , X and obtain the OLS residuals, #,, t = 1,2,...,n. 
(i) Run the regression in equation (12.20) and obtain p. 
(iii) Apply OLS to equation (12.39) to estimate Bp, B,,...,8,. The usual standard errors, 


t statistics, and F statistics are asymptotically valid. 


The cost of using p in place of p is that the FGLS estimator has no tractable finite sample properties. 
In particular, it is not unbiased, although it is consistent when the data are weakly dependent. Further, 
even if e, in (12.38) is normally distributed, the ¢ and F statistics are only approximately ¢ and F 
distributed because of the estimation error in p. This is fine for most purposes, although we must be 
careful with small sample sizes. 

Because the FGLS estimator is not unbiased, we certainly cannot say it is BLUE. Nevertheless, it 
is asymptotically more efficient than the OLS estimator when the AR(1) model for serial correlation 
holds (and the explanatory variables are strictly exogenous). Again, this statement assumes that the 
time series are weakly dependent. 

There are several names for FGLS estimation of the AR(1) model that come from different methods of 
estimating p and different treatment of the first observation. Cochrane-Orcutt (CO) estimation omits the 
first observation and uses p from (12.20), whereas Prais-Winsten (PW) estimation uses the first observa- 
tion in the previously suggested way. Asymptotically, it makes no difference whether or not the first obser- 
vation is used, but many time series samples are small, so the differences can be notable in applications. 

In practice, both the Cochrane-Orcutt and Prais-Winsten methods are used in an iterative scheme. 
That is, once the FGLS estimator is found using p from (12.20), we can compute a new set of residu- 
als, obtain a new estimator of p from (12.20), transform the data using the new estimate of p, and esti- 
mate (12.39) by OLS. We can repeat the whole process many times, until the estimate of p changes 
by very little from the previous iteration. Many regression packages implement an iterative procedure 
automatically, so there is no additional work for us. It is difficult to say whether more than one itera- 
tion helps. It seems to be helpful in some cases, but, theoretically, the large-sample properties of the 
iterated estimator are the same as the estimator that uses only the first iteration. For details on these 
and other methods, see Davidson and MacKinnon (1993, Chapter 10). 


Prais-Winsten Estimation in the Event Study 


Again using the data in BARIUM, we estimate the equation in Example 10.5 using iterated Prais- 
Winsten estimation. For comparison, we also present the OLS results in Table 12.1. 

The coefficients that are statistically significant in the Prais-Winsten estimation do not differ by 
much from the OLS estimates [in particular, the coefficients on log(chempi), log(rtwex), and afdec6]. 
It is not surprising for statistically insignificant coefficients to change, perhaps markedly, across dif- 
ferent estimation methods. 

Notice how the standard errors in the second column are uniformly higher than the standard errors 
in column (1). This is common. The Prais-Winsten standard errors at least account for AR(1) serial 
correlation; the OLS standard errors do not. As we saw in Section 12-1, the OLS standard errors usu- 
ally understate the actual sampling variation in the OLS estimates and should not be relied upon when 
significant serial correlation is present. Therefore, the effect on Chinese imports after the International 
Trade Commission’s decision is now less statistically significant than we thought (tgjecs = — 1.69). 
Computer Exercise C15 at the end of this chapter asks you to compute Newey-West standard errors 
for OLS. In a serious study, these would be reported at least along with the usual OLS standard errors, 
if not in place of them. 

Finally, an R-squared is reported for the PW estimation that is well below the R-squared for 
the OLS estimation in this case. However, these R-squareds should not be compared. For OLS, the 
R-squared, as usual, is based on the regression with the untransformed dependent and independent 
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variables. For PW, the R-squared comes from the final regression of the transformed dependent 
variable on the transformed independent variables. It is not clear what this R? is actually measuring; 
nevertheless, it is traditionally reported. 


TABLE 12.1 Dependent Variable: log(chnimp) 


Coefficient OLS Prais-Winsten 
log(chempi) 312 2.94 
(0.48) (0.63) 
log(gas) 196 1.05 
(.907) (0.98) 
log(twex) .983 iS 
(.400) (0.51) 
befile6 .060 —.016 
(.261) (.322) 
affile6 —.032 —.033 
(.264) (.322) 
afdec6 —.565 —.577 
(.286) (.342) 
intercept —17.80 —37.08 
(21.05) (22.78) 
p .293 
Observations 131 131 
R-squared 305 202 


12-4c Comparing OLS and FGLS 


In some applications of the Cochrane-Orcutt or Prais-Winsten methods, the FGLS estimates differ 
in practically important ways from the OLS estimates. (This was not the case in Example 12.5.) 
Typically, this has been interpreted as a verification of FGLS’s superiority over OLS. Unfortunately, 
things are not so simple. To see why, consider the regression model 


Yı = Bo + Bix, + uy, 


where the time series processes are stationary. Now, assuming that the law of large numbers holds, 
consistency of OLS for 6, holds if 


Cov(x,, u,) = 0. [12.40] 


Earlier, we asserted that FGLS was consistent under the strict exogeneity assumption, which is more 
restrictive than (12.40). In fact, it can be shown that the weakest assumption that must hold for FGLS 
to be consistent, in addition to (12.40), is that the sum of x,_, and x,,, is uncorrelated with u,: 


Cov[ (x-1 + x41), u,] = 0. [12.41] 


Practically speaking, consistency of FGLS requires u, to be uncorrelated with x,_,, x,, and x, 4. 

How can we show that condition (12.41) is needed along with (12.40)? The argument is simple 
if we assume p is known and drop the first time period, as in Cochrane-Orcutt. The argument when 
we use p is technically harder and yields no additional insights. Because one observation cannot 
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affect the asymptotic properties of an estimator, dropping it does not affect the argument. Now, with 
known p, the GLS estimator uses x, — px,_, as the regressor in an equation where u, — pu,_, is the 
error. From Theorem 11.1, we know the key condition for consistency of OLS is that the error and the 
regressor are uncorrelated. In this case, we need E[(x, — px,_,)(u, — pu,—,)] = 0. If we expand the 
expectation, we get 


E[ (x; E px,- 1) (u; = pu,—1)] = E(xu,) Z pE(x,— 114) = pE(x,u,-1) st PE(x -1u -1) 
= —plE(x,-1u,) + E(xu,-1)] 


because E(xu,) = E(x,- ;u,-1) = 0 by assumption (12.40). Now, under stationarity, E(x,,_;)= 
E(x,,,u,) because we are just shifting the time index one period forward. Therefore, 


E(x; —1U,) + E(x,4,-1) = El (x-1 ao X,+1)u], 


and the last expectation is the covariance in equation (12.41) because E(u,) = 0. We have shown that 
(12.41) is necessary along with (12.40) for GLS to be consistent for B,. [Of course, if p = 0, we do 
not need (12.41) because we are back to doing OLS. ] 

Our derivation shows that OLS and FGLS might give significantly different estimates because 
(12.41) fails. In this case, OLS—which is still consistent under (12.40)—is preferred to FGLS (which 
is inconsistent). If x has a lagged effect on y, or x,;, reacts to changes in u, FGLS can produce mis- 
leading results. 

Because OLS and FGLS are different estimation procedures, we never expect them to give the 
same estimates. If they provide similar estimates of the 6,, then FGLS is preferred if there is evi- 
dence of serial correlation because the estimator is more efficient and the FGLS test statistics are 
at least asymptotically valid. A more difficult problem arises when there are practical differences in 
the OLS and FGLS estimates: it is hard to determine whether such differences are statistically sig- 
nificant. The general method proposed by Hausman (1978) can be used, but it is beyond the scope 
of this text. 

The next example gives a case where OLS and FGLS are different in practically important ways. 


Static Phillips Curve 


Table 12.2 presents OLS and iterated Prais-Winsten estimates of the static Phillips curve from 
Example 10.1, using the observations through 2006. 


TABLE 12.2 Dependent Variable: inf 


Coefficient OLS Prais-Winsten 
unem .468 —.716 
(.289) (.313) 
intercept 1.424 8.296 
(1.719) (2.231) 
p 781 
Observations 49 49 
R-squared 053 136 


The coefficient of interest is on unem, and it differs markedly between PW and OLS. Because the PW 
estimate is consistent with the inflation-unemployment tradeoff, our tendency is to focus on the PW 
estimates. In fact, these estimates are fairly close to what is obtained by first differencing both inf and 
unem (see Computer Exercise C4 in Chapter 11), which makes sense because the quasi-differencing 
used in PW with p = .781 is similar to first differencing. It may just be that inf and unem are not 
related in levels, but they have a negative relationship in first differences. 
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Examples like the static Phillips curve can pose difficult problems for empirical researchers. On 
the one hand, if we are truly interested in a static relationship, and if unemployment and inflation 
are I(0) processes, then OLS produces consistent estimators without additional assumptions. But it 
could be that unemployment, inflation, or both have unit roots, in which case OLS need not have its 
usual desirable properties; we discuss this further in Chapter 18. In Example 12.6, FGLS gives more 
economically sensible estimates; because it is similar to first differencing, FGLS has the advantage of 
(approximately) eliminating unit roots. 


12-4d Correcting for Higher-Order Serial Correlation 


It is also possible to correct for higher orders of serial correlation. A general treatment is given in 
Harvey (1990). Here, we illustrate the approach for AR(2) serial correlation: 


Uy = Pyly—1 + Poy—2 + €, 
where {e,} satisfies the assumptions stated for the AR(1) model. The stability conditions are more 
complicated now. They can be shown to be [see Harvey (1990)] 
P2 > —1, p — pı < 1,andp,; + p < 1. 


For example, the model is stable if p, = .8 and p) = —.3; the model is unstable if p, = .7 and p, = .4. 
Assuming the stability conditions hold, we can obtain the transformation that eliminates the 
serial correlation. In the simple regression model, this is easy when t > 2: 


Yt T Piyt-1 T P2Yr-2 = Bo(1 Pi p2) T Bix ~ Pi- 7 PoX;-2) +e, 
or 
J, = Boll — pı — po) + BX, + ent = 3,4,...,0. [12.42] 


If we know p; and p, we can easily estimate this equation by OLS after obtaining the transformed 
variables. Because we rarely know p, and p2, we have to estimate them. As usual, we can use the OLS 
residuals, ĝ,: obtain 6, and p, from the regression of 


ñon tt, 1, Ì2 t = 3,..., N. 


[This is the same regression used to test for AR(2) serial correlation with strictly exogenous regres- 
sors.] Then, we use fp, and ĝ, in place of p, and p, to obtain the transformed variables. This gives 
one version of the FGLS estimator. If we have multiple explanatory variables, then each one is trans- 
formed by Xj = xj — PiX;—1,; — P2%;—-2,;, when t > 2. 

The treatment of the first two observations is a little tricky. It can be shown that the dependent 
variable and each independent variable (including the intercept) should be transformed by 


Zi = {(1 4 pl pr)” piv pr) } 72 
Z= (t= pi) z = [pi a pi) (1 = P2) zi 


where z, and z, denote either the dependent or an independent variable at t = 1 and t = 2, respec- 
tively. We will not derive these transformations. Briefly, they eliminate the serial correlation between 
the first two observations and make their error variances equal to g2. 

Fortunately, econometrics packages geared toward time series analysis easily estimate models 
with general AR(q) errors; we rarely need to directly compute the transformed variables ourselves. 


12-4e What if the Serial Correlation Model Is Wrong? 


Even an AR(2) model is simple compared with the possible ways that the errors {u; t = 1,2,...,n} 
can exhibit serial correlation, and one usually sees the simplest model, the AR(1) model, used in prac- 
tice. What happens if our chosen model is incorrect? Or, maybe the AR(1) model is correct, but there 
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is heteroskedasticity in the errors? Are the CO or PW estimates worthless if the errors do not follow 
an AR(1) model? 

Not at all. The calculations in Section 12-4c show that having the model of serial correlation 
wrong does not cause inconsistent estimation of the B;, provided the explanatory variables satisfy the 
strict exogeneity assumption in equation (12.41) [along with contemporaneous exogeneity in (12.40)]. 
That is an important lesson to stress again: exogeneity of the explanatory variables is what matters 
for consistency, not the serial correlation or variance properties of the errors. However, the usual OLS 
inference in the quasi-differenced equation (12.38) [or its feasible version in (12.39)] will be incorrect 
because then {e,: t = 1, 2,...} will exhibit serial correlation. For example, if {u,} actually follows an 
AR(2) but we use an AR(1) model, the errors in (12.38) will exhibit a complicated serial correlation 
pattern. Fortunately, it is very easy to fix the usual FGLS inference: because (12.39) is estimated by 
OLS, we can apply the Newey-West standard errors to this equation. In other words, obtain the quasi- 
differenced variables, use them in an OLS regression, and apply Newey-West or some other HAC. As 
a bonus, the standard errors are robust to arbitrary heteroskedasticity in {u,}. 

If we think the AR(1) model is correct, but are worried about heteroskedasticity in {e,}, then we 
can obtain the usual heteroskedasticity-robust standard error when estimating (12.39) by OLS. This 
possibility is covered in more detail in Section 12-6d. 

The idea of using HAC inference after we have used an FGLS procedure may seem odd: after 
all, the whole point of using something like CO or PW is to eliminate serial correlation. Nevertheless, 
it is a good idea, especially when using simple models, to not take them seriously when performing 
inference. It could very well be that using a PW correction—assuming strictly exogenous x,—is more 
efficient than OLS even if the AR(1) model is not the correct model. Accounting for some of the serial 
correlation through an AR(1) model might be considerably better than ignoring serial correlation in 
estimation. But the careful researcher readily admits that the AR(1) structure could be incorrect, or that 
there is possibly heteroskedasticity, and, therefore, conducts inference that is fully robust. Such strate- 
gies are not built into standard econometrics software, but it is fairly easy to implement “by hand.” 

We faced a similar situation in Section 8-4c when discussing the relative merits of OLS and 
weighted least squares. Assuming the conditional mean is correctly specified, it might be better to use 
WLS with an incorrect heteroskedasticity function than to use OLS because WLS might result in an 
efficiency gain. But we need to be objective when comparing the estimators by computing standard 
errors that allow for general heteroskedasticity, without assuming that the one we have chosen is the 
correct one. The same principle applies when using FGLS for serial correlation corrections. 


12-5 Differencing and Serial Correlation 


In Chapter 11, we presented differencing as a transformation for making an integrated process weakly 
dependent. There is another way to see the merits of differencing when dealing with highly persistent 
data. Suppose that we start with the simple regression model: 


Y: = Bo + Bix, + u,t = eee [12.43] 


where u, follows the AR(1) process in (12.32). As we mentioned in Section 11-3, and as we will 
discuss more fully in Chapter 18, the usual OLS inference procedures can be very misleading when 
the variables y, and x, are integrated of order one, or I(1). In the extreme case where the errors {u,} in 
(12.43) follow a random walk, the equation makes no sense because, among other things, the variance 
of u, grows with t. It is more logical to difference the equation: 


Ay, = B,Ax, + Au, t = 2,...,n. [12.44] 


If u, follows a random walk, then e, = Au, has zero mean and a constant variance and is serially 
uncorrelated. Thus, assuming that e, and Ax, are uncorrelated, we can estimate (12.44) by OLS, where 
we lose the first observation. 
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Even if u, does not follow a random walk, but p is positive and large, first differencing is often a 
good idea: it will eliminate most of the serial correlation. Of course, equation (12.44) is different from 
(12.43), and one must keep that in mind in comparing the OLS estimates from the different equations. 
As far as inference, just as we can apply Newey-West to (12.37), we can do the same after estimating 
(12.38) by OLS. We do not want to assume, necessarily, that there is no serial correlation in (12.38). 
Allowing for multiple explanatory variables does not change anything. 


Differencing the Interest Rate Equation 


In Example 10.2, we estimated an equation relating the three-month T-bill rate to inflation and the 
federal deficit [see equation (10.15)]. If we obtain the residuals obtained from estimating (10.15) and 
regress them on a single lag, we obtain À = .623(.110), which is large and very statistically signifi- 
cant. Therefore, at a minimum, serial correlation is a problem in this equation. 

If we difference the data and run the regression, we obtain 


Ai3, = .042 + .149 Ainf, — .181 Adef, + ê, 
(.171) (.092) (.148) [12.45] 
n = 55, R? = .176, R? = .145. 

The coefficients from this regression are very different from the equation in levels, suggesting either 
that the explanatory variables are not strictly exogenous or that one or more of the variables has a unit 
root. In fact, the correlation between i3, and 73,_, is about .885, which may indicate a problem with 
interpreting (10.15) as a meaningful regression. Plus, the regression in differences has essentially no 
serial correlation: a regression of ê, on é,_, gives À = .072 (.134). Because first differencing elimi- 
nates possible unit roots as well as serial correlation, we probably have more faith in the estimates 
and standard errors from (12.45) than (10.15). The equation in differences shows that annual changes 
in interest rates are only weakly, positively related to annual changes in inflation, and the coefficient 
on Adef, is actually negative (though not statistically significant at even the 20% significance level 
against a two-sided alternative). 


a= GOING FURTHER 12.4 As we explained in Chapter 11, the decision of whether or not 
Suppose after estimating a model by OLS | t° difference is a tough one. But this discussion points out another 
that you estimate p from regression (12.20) | benefit of differencing, which is that it removes serial correlation. We 
and you obtain ô = .92. What would you do | will return to this issue in Chapter 18. 

about this? 


12-6 Heteroskedasticity in Time Series Regressions 


We discussed testing and correcting for heteroskedasticity for cross-sectional applications in Chapter 8. 
Heteroskedasticity can also occur in time series regression models, and the presence of heteroske- 
dasticity, while not causing bias or inconsistency in the Ê, does invalidate the usual standard errors, 
t statistics, and F statistics. This is just as in the cross-sectional case. 

In time series regression applications, heteroskedasticity often receives little, if any, attention: 
the problem of serially correlated errors is usually more pressing. Nevertheless, it is useful to briefly 
cover some of the issues that arise in applying tests and corrections for heteroskedasticity in time 
series regressions. 

Because the usual OLS statistics are asymptotically valid under Assumptions TS.1’ through 
TS.5’, we are interested in what happens when the homoskedasticity assumption TS.4’ does not 
hold. Assumption TS.3’ rules out misspecifications such as omitted variables and certain kinds of 
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measurement error, while TS.5’ rules out serial correlation in the errors. It is important to remember 
that serially correlated errors cause problems that adjustments for heteroskedasticity are not able to 
address. 


12-6a Heteroskedasticity-Robust Statistics 


In studying heteroskedasticity for cross-sectional regressions, we noted how it has no bearing on the 
unbiasedness or consistency of the OLS estimators. Exactly the same conclusions hold in the time 
series Case, as we can see by reviewing the assumptions needed for unbiasedness (Theorem 10.1) and 
consistency (Theorem 11.1). 

In Section 8-2, we discussed how the usual OLS standard errors, t statistics, and F statistics can 
be adjusted to allow for the presence of heteroskedasticity of unknown form. These same adjustments 
work for time series regressions under Assumptions TS.1’, TS.2’, TS.3’, and TS.5’. Thus, provided the 
only assumption violated is the homoskedasticity assumption, valid inference is easily obtained in 
most econometric packages. 


12-6b Testing for Heteroskedasticity 


Sometimes, we wish to test for heteroskedasticity in time series regressions, especially if we are con- 
cerned about the performance of heteroskedasticity-robust statistics in relatively small sample sizes. 
The tests we covered in Chapter 8 can be applied directly, but with a few caveats. First, the errors u, 
should not be serially correlated; any serial correlation will generally invalidate a test for heteroske- 
dasticity. Thus, it makes sense to test for serial correlation first, using a heteroskedasticity-robust test 
if heteroskedasticity is suspected. Then, after something has been done to correct for serial correla- 
tion, we can test for heteroskedasticity. 
Second, consider the equation used to motivate the Breusch-Pagan test for heteroskedasticity: 


ur = ôo + 6ix_ He + Oxy + V, [12.46] 


where the null hypothesis is Hy: 6; = 6, = --- = 6, = 0. For the F statistic—with a replacing u? 
as the dependent variable—to be valid, we must assume that the errors {v,} are themselves homo- 
skedastic (as in the cross-sectional case) and serially uncorrelated. These are implicitly assumed in 
computing all standard tests for heteroskedasticity, including the version of the White test we covered 
in Section 8-3. Assuming that the {v,} are serially uncorrelated rules out certain forms of dynamic 
heteroskedasticity, something we will treat in the next subsection. 

If heteroskedasticity is found in the u, (and the u, are not serially correlated), then the 
heteroskedasticity-robust test statistics can be used. An alternative is to use weighted least squares, 
as in Section 8-4. The mechanics of weighted least squares for the time series case are identical to 
those for the cross-sectional case. 


EXAMPLE 12.8 Heteroskedasticity and the Efficient Markets Hypothesis 


In Example 11.4, we estimated the simple model 
return, = By + By,return,_, + u. [12.47] 
The EMH states that 8, = 0. When we tested this 


GOING FURTHER 12.5 hypothesis using the data in NYSE, we obtained 


How would you compute the White test for | fe, = 1-55 with n = 689. With such a large sample, 
heteroskedasticity in equation (12.47)? this is not much evidence against the EMH. Although 


the EMH states that the expected return given past 
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observable information should be constant, it says nothing about the conditional variance. In fact, the 
Breusch-Pagan test for heteroskedasticity entails regressing the squared OLS residuals #7 on return, _,: 
it? = 4.66 — 1.104 return,_, + residual, 
(0.43) (0.201) [12.48] 
n = 689, R? = .042. 


The ¢ statistic on return,_, is about —5.5, indicating strong evidence of heteroskedasticity. Because the 
coefficient on return,_, is negative, we have the interesting finding that volatility in stock returns is 
lower when the previous return was high, and vice versa. Therefore, we have found what is common 
in many financial studies: the expected value of stock returns does not depend on past returns, but the 
variance of returns does. 


12-6c Autoregressive Conditional Heteroskedasticity 


In recent years, economists have become interested in dynamic forms of heteroskedasticity. Of course, 

if x, contains a lagged dependent variable, then heteroskedasticity as in (12.46) is dynamic. But dynamic 

forms of heteroskedasticity can appear even in models with no dynamics in the regression equation. 
To see this, consider a simple static regression model: 


Yi = Bo + Biz + uy, 


and assume that the Gauss-Markov assumptions hold. This means that the OLS estimators are BLUE. 
The homoskedasticity assumption says that Var(u,|Z) is constant, where Z denotes all n outcomes 
of z,. Even if the variance of u, given Z is constant, there are other ways that heteroskedasticity can 
arise. Engle (1982) suggested looking at the conditional variance of u, given past errors (where the 
conditioning on Z is left implicit). Engle suggested what is known as the autoregressive conditional 
heteroskedasticity (ARCH) model. The first-order ARCH model is 


E(uj|u,—1, Ur- .) E E(u;|u,—1) =a) + Ayu- s [12.49] 


where we leave the conditioning on Z implicit. This equation represents the conditional variance of u, 
given past u, only if E(u,ļu,— 1, u,—2,...) = 0, which means that the errors are serially uncorrelated. 
Because conditional variances must be positive, this model only makes sense if ay > 0 and a, = 0; if 
a, = 0, there are no dynamics in the variance equation. 

It is instructive to write (12.49) as 


W = a + aui + v, [12.50] 


where the expected value of v, (given u,—1, U,—,.. .) is zero by definition. (However, the v, are not 
independent of past u, because of the constraint v, = —ay — a,u?_,.) Equation (12.50) looks like 
an autoregressive model in u? (hence the name ARCH). The stability condition for this equation is 
a, < 1, just as in the usual AR(1) model. When a, > 0, the squared errors contain (positive) serial 
correlation even though the u, themselves do not. 

What implications does (12.50) have for OLS? Because we began by assuming the Gauss- 
Markov assumptions hold, OLS is BLUE. Further, even if u, is not normally distributed, we know that 
the usual OLS test statistics are asymptotically valid under Assumptions TS.1’ through TS.5’, which 
are satisfied by static and distributed lag models with ARCH errors. 

If OLS still has desirable properties under ARCH, why should we care about ARCH forms of 
heteroskedasticity in static and distributed lag models? We should be concerned for two reasons. First, 
it is possible to get consistent (but not unbiased) estimators of the 6; that are asymptotically more 
efficient than the OLS estimators. A weighted least squares procedure, based on estimating (12.50), 
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will do the trick. A maximum likelihood procedure also works under the assumption that the errors u, 
have a conditional normal distribution. Second, economists in various fields have become interested 
in dynamics in the conditional variance. Engle’s original application was to the variance of U.K. 
inflation, where he found that a larger magnitude of the error in the previous time period (larger u?_,) 
was associated with a larger error variance in the current period. Because variance is often used 
to measure volatility, and volatility is a key element in asset pricing theories, ARCH models have 
become important in empirical finance. 

ARCH models also apply when there are dynamics in the conditional mean. Suppose we have the 
dependent variable, y,, a contemporaneous exogenous variable, z,, and 


E(y,lz, Yt- 1> Zt- Vr-29 + « a = Bo + Bizi + Boy,-1 + Bs%-1, 


so that at most one lag of y and z appears in the dynamic regression. The typical approach is to assume 
that Var(y,|z,, ¥;— 1. Z-1 ¥y-2» » - -) is constant, as we discussed in Chapter 11. But this variance could 
follow an ARCH model: 


Zo Yea aa) = Vara ey Vin Sis Y-2 ---) 
—_ 2 
= Qo + QU; 1, 


Var(y, 


where u, = y, — E(y|Z ¥;—15 Z:—15 )y-2» » - -). AS we know from Chapter 11, the presence of ARCH 
does not affect consistency of OLS, and the usual heteroskedasticity-robust standard errors and test 
statistics are valid. (Remember, these are valid for any form of heteroskedasticity, and ARCH is just 
one particular form of heteroskedasticity.) 

If you are interested in the ARCH model and its extensions, see Bollerslev, Chou, and Kroner 
(1992) and Bollerslev, Engle, and Nelson (1994) for 24+ years old surveys. 


ARCH in Stock Returns 


In Example 12.8, we saw that there was heteroskedasticity in weekly stock returns. This heteroske- 
dasticity is actually better characterized by the ARCH model in (12.50). If we compute the OLS 
residuals from (12.47), square these, and regress them on the lagged squared residual, we obtain 


i? = 2.95 + 337 #_, + residual, 


(.44) (.036) [12.51] 
n = 688, R? = .114. 
2 


The ¢ statistic on u#;_ | is over nine, indicating strong ARCH. As we discussed earlier, a larger error at 
time ¢ — | implies a larger variance in stock returns today. 

It is important to see that, though the squared OLS residuals are autocorrelated, the OLS residu- 
als themselves are not (as is consistent with the EMH). Regressing i, on ĉ,—; gives p = .0014 with 
t = .038. 


12-6d Heteroskedasticity and Serial Correlation in Regression Models 


Nothing rules out the possibility of both heteroskedasticity and serial correlation being present in a 
regression model. If we are unsure, we can always use OLS and compute fully robust standard errors, 
as described in Section 12-5. 

Much of the time serial correlation is viewed as the most important problem, because it usually 
has a larger impact on standard errors and the efficiency of estimators than does heteroskedasticity. As 
we concluded in Section 12-2, obtaining tests for serial correlation that are robust to arbitrary heter- 
oskedasticity is fairly straightforward. If we detect serial correlation using such a test, we can employ 
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the Cochrane-Orcutt (or Prais-Winsten) transformation [see equation (12.38)] and, in the transformed 
equation, use heteroskedasticity-robust standard errors and test statistics. Or, we can even test for het- 
eroskedasticity in (12.38) using the Breusch-Pagan or White tests. 

Alternatively, we can model heteroskedasticity and serial correlation and correct for both through 
a combined weighted least squares AR(1) procedure. Specifically, consider the model 


YS Bo + Bixa Hee + PX t Uy 
u, = Vhy, [12.52] 
pl <1, 


V, = PVi-1 + en 


where the explanatory variables X are independent of e, for all t, and h, is a function of the x,. The 
process {e,} has zero mean and constant variance go? and is serially uncorrelated. Therefore, {v,} satis- 
fies a stable AR(1) process. The error u, is heteroskedastic, in addition to containing serial correlation: 

Var(ulx,) = oh, 
where o2 = o2/(1 — p°). But v, = ul h, is homoskedastic and follows a stable AR(1) model. 
Therefore, the transformed equation 


y/V hy, = Bo(1/Vh,) F Bi(xa/Vh,) apea BelXn/Vh,) +v, [12.53] 


has AR(1) errors. Now, if we have a particular kind of heteroskedasticity in mind—that is, we know 
h,—we can estimate (12.53) using standard CO or PW methods. 

In most cases, we have to estimate h, first. The following method combines the weighted least 
squares method from Section 8-4 with the AR(1) serial correlation correction from Section 12-3. 


Feasible GLS with Heteroskedasticity and AR(1) Serial Correlation: 


(i) Estimate (12.52) by OLS and save the residuals, iz,. 

(ii) Regress log(i#) on xa, . . . , xa (or on $,, $7) and obtain the fitted values, say, ĝ,. 
Gii) Obtain the estimates of h,: h, = exp(@,). 

(iv) Estimate the transformed equation 


hr "y, = hy PBa + Bil Pri Fse Bhir Xn + error, [12.54] 


by standard Cochrane-Orcutt or Prais-Winsten methods. 

The FGLS estimators obtained from the procedure are asymptotically efficient provided the 
assumptions in model (12.52) hold. More importantly, all standard errors and test statistics from the 
CO or PW estimation are asymptotically valid. If we allow the variance function to be misspecified, 
or allow the possibility that any serial correlation does not follow an AR(1) model, then we can apply 
quasi-differencing to (12.54), estimating the resulting equation by OLS, and then obtain the Newey- 
West standard errors. By doing so, we would be using a procedure that could be asymptotically effi- 
cient while ensuring that our inference is valid (asymptotically) if we have misspecified our model of 
either heteroskedasticity or serial correlation. 


Summary 


We have covered the important problem of serial correlation in the errors of multiple regression models. 
Positive correlation between adjacent errors is common, especially in static and finite distributed lag models. 
Serial correlation in the errors generally causes the usual OLS standard errors and statistics to be mislead- 
ing (although the Ê; can still be unbiased, or at least consistent). Typically, the OLS standard errors under- 
estimate the true uncertainty in the parameter estimates. 
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Econometrics and statistics packages now routinely compute standard errors and test statistics that are 
robust to general serial correlation; as a bonus, they are also robust to heteroskedasticity of unknown form. 
As we discussed in Sections 12-2 and 12-4c, OLS is consistent under weaker assumptions than GLS solu- 
tions to serial correlation. Therefore, it has become common to rely on OLS and to compute so-called HAC 
standard errors, the most popular being Newey-West standard errors. 

If the Newey-West standard errors after OLS are acceptably small, we may not even care whether 
serial correlation is present. Nevertheless, as we discussed in Section 12-3, there are good reasons to know 
whether the errors are strongly autocorrelated. Obtaining tests under strict exogeneity, and even relaxing 
the strict exogeneity assumption, is easy. It simply requires regressing OLS residuals on lagged residuals 
(and maybe the explanatory variables). 

In models with strictly exogenous regressors, we can use a feasible GLS procedure—Cochrane-Orcutt 
or Prais-Winsten—to correct for AR(1) serial correlation. This results in estimates that are different from 
the OLS estimates: the FGLS estimates are obtained from OLS on quasi-differenced variables. Assuming 
that the AR(1) model is correct, and that the errors are homoskedastic, all of the usual test statistics from 
the transformed equation are asymptotically valid. Almost all regression packages have built-in features for 
estimating models with AR(1) errors. 

Finally, we discussed some special features of heteroskedasticity in time series models. As in the 
cross-sectional case, the most important kind of heteroskedasticity is that which depends on the explana- 
tory variables; this is what determines whether the usual OLS statistics are valid. The Breusch-Pagan and 
White tests covered in Chapter 8 can be applied directly, with the caveat that the errors should not be 
serially correlated. In recent years, economists—especially those who study the financial markets—have 
become interested in dynamic forms of heteroskedasticity. The ARCH model is the leading example. 


Key Terms 


AR(1) Serial Correlation Feasible GLS (FGLS) Serial Correlation—Robust Standard 
Autoregressive Conditional Heteroskedasticity and Autocorrelation Error 

Heteroskedasticity (ARCH) Consistent (HAC) standard errors Truncation Lag 
Breusch-Godfrey Test Newey- West Standard Errors Weighted Least Squares 
Cochrane-Orcutt (CO) Estimation Prais-Winsten (PW) Estimation 
Durbin-Watson (DW) Statistic Quasi-Differenced Data 


Problems 


1 When the errors in a regression model have AR(1) serial correlation, why do the OLS standard errors 
tend to underestimate the sampling variation in the 6;? Is it always true that the OLS standard errors 
are too small? 


2 Explain what is wrong with the following statement: “The Cochrane-Orcutt and Prais-Winsten methods 
are both used to obtain valid standard errors for the OLS estimates when there is a serial correlation.” 


3 In Example 10.6, we used the data in FAIR to estimate a variant on Fair’s model for predicting 
presidential election outcomes in the United States. 
(i) | What argument can be made for the error term in this equation being serially uncorrelated? 
(Hint: How often do presidential elections take place?) 
(ii) | When the OLS residuals from (10.23) are regressed on the lagged residuals, we obtain 
p = —.068 and se(p) = .240. What do you conclude about serial correlation in the u,? 
(iii) Does the small sample size in this application worry you in testing for serial correlation? 


4 True or false: “If the errors in a regression model contain ARCH, they must be serially correlated.” 
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5 (i) In the enterprise zone event study in Computer Exercise C5 in Chapter 10, a regression of the 
OLS residuals on the lagged residuals produces p = .841 and se(p) = .053. What implications 
does this have for OLS? 

(ii) If you want to use OLS but also want to obtain a valid standard error for the EZ coefficient, 
what would you do? 


6 In Example 12.8, we found evidence of heteroskedasticity in u, in equation (12.47). Thus, we compute 
the heteroskedasticity-robust standard errors (in [-]) along with the usual standard errors: 


return, = .180 + .059 return,_; 
(.081) (.038) 
[.085] [.069] 
n = 689, R? = .0035, R? = .0020. 


What does using the heteroskedasticity-robust ¢ statistic do to the significance of return,_,? 


7 Consider a standard multiple linear regression model with time series data: 


Ye = Bo + Bixa Ho + Byte + Up 


Assume that Assumptions TS.1, TS.2, TS.3, and TS.4 all hold. 

(i) | Suppose we think that the errors {u,} follow an AR(1) model with parameter p and so we apply 
the Prais-Winsten method. If the errors do not follow an AR(1) model—for example, suppose 
they follow an AR(2) model, or an MA(1) model—why will the usual Prais-Winsten standard 
errors be incorrect? 

(ii) Can you think of a way to use the Newey-West procedure, in conjunction with Prais-Winsten 
estimation, to obtain valid standard errors? Be very specific about the steps you would follow. 
[Hint: It may help to study equation (12.32) and note that, if {u,} does not follow an AR(1) 
process, e, generally should be replaced by u, — pu;—1, where p is the probability limit of the 
estimator p. Now, is the error {u, — pu,—,} serially uncorrelated in general? What can you do if 
it is not?] 

(iii) Explain why your answer to part (ii) should not change if we drop Assumption TS.4. 


8 Suppose in a static or distributed lag time series regression, you are able to use n = 280 quarterly 
observations. What would be some reasonable values for the lag g in the Newey-West estimator? 


Computer Exercises 


C1 In Example 11.6, we estimated a finite DL model in first differences (changes): 


cgfr, = Yo + Socpe, + d,cpe,-,; + cpe,- + u, 
Use the data in FERTIL3 to test whether there is AR(1) serial correlation in the errors. 


C2 (i) Using the data in WAGEPRC, estimate the distributed lag model from Problem 5 in Chapter 11. 
Use regression (12.20) to test for AR(1) serial correlation. 
(ii) Reestimate the model using iterated Cochrane-Orcutt estimation. What is your new estimate of 
the long-run propensity? 
(iii) Using iterated CO, find the standard error for the LRP. (This requires you to estimate a modified 
equation.) Determine whether the estimated LRP is statistically different from one at the 5% level. 


C3 (i) In part (i) of Computer Exercise C6 in Chapter 11, you were asked to estimate the accelerator 
model for inventory investment. Test this equation for AR(1) serial correlation. 
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C4 


C5 


C6 


C7 


C8 


c9 


(ii) 
(i) 

(ii) 
(iii) 


(iv) 


If you find evidence of serial correlation, reestimate the equation by Cochrane-Orcutt and 
compare the results. 


Use NYSE to estimate equation (12.48). Let Å, be the fitted values from this equation (the esti- 
mates of the conditional variance). How many h, are negative? 

Add return?_, to (12.48) and again compute the fitted values, hi, Are any h, negative? 

Use the h, from part (ii) to estimate (12.47) by weighted least squares (as in Section 8-4). 
Compare your estimate of 8, with that in equation (11.16). Test Hy: 8; = 0 and compare the 
outcome when OLS is used. 

Now, estimate (12.47) by WLS, using the estimated ARCH model in (12.51) to obtain the h,. 
Does this change your findings from part (iii)? 


Consider the version of Fair’s model in Example 10.6. Now, rather than predicting the proportion of 
the two-party vote received by the Democrat, estimate a linear probability model for whether or not the 
Democrat wins. 


G) 


(ii) 


(iii) 


(iv) 


(v) 


(vi) 


(a) 


(ii) 


G) 
(ii) 


Use the binary variable demwins in place of demvote in (10.23) and report the results in 
standard form. Which factors affect the probability of winning? Use the data only through 1992. 
How many fitted values are less than zero? How many are greater than one? 

Use the following prediction rule: if demwins > .5, you predict the Democrat wins; otherwise, 
the Republican wins. Using this rule, determine how many of the 20 elections are correctly 
predicted by the model. 

Plug in the values of the explanatory variables for 1996. What is the predicted probability that 
Clinton would win the election? Clinton did win; did you get the correct prediction? 

Use a heteroskedasticity-robust ¢ test for AR(1) serial correlation in the errors. What do you 
find? 

Obtain the heteroskedasticity-robust standard errors for the estimates in part (i). Are there 
notable changes in any f statistics? 


In Computer Exercise C7 in Chapter 10, you estimated a simple relationship between consump- 
tion growth and growth in disposable income. Test the equation for AR(1) serial correlation 
(using CONSUMP). 

In Computer Exercise C7 in Chapter 11, you tested the permanent income hypothesis by 
regressing the growth in consumption on one lag. After running this regression, test for 
heteroskedasticity by regressing the squared residuals on gc,_, and gc?_,. What do you conclude? 


For Example 12.4, using the data in BARIUM, obtain the iterative Cochrane-Orcutt estimates. 
Are the Prais-Winsten and Cochrane-Orcutt estimates similar? Did you expect them to be? 


Use the data in TRAFFIC2 for this exercise. 


(i) 


(ii) 


(iii) 


Run an OLS regression of prcfat on a linear time trend, monthly dummy variables, and the 
variables wkends, unem, spdlaw, and beltlaw. Test the errors for AR(1) serial correlation 

using the regression in equation (12.20). Does it make sense to use the test that assumes strict 
exogeneity of the regressors? 

Obtain serial correlation— and heteroskedasticity-robust standard errors for the coefficients 

on spdlaw and beltlaw, using four lags in the Newey-West estimator. How does this affect the 
statistical significance of the two policy variables? 

Now, estimate the model using iterative Prais-Winsten and compare the estimates with the OLS 
estimates. Are there important changes in the policy variable coefficients or their statistical 
significance? 


The file FISH contains 97 daily price and quantity observations on fish prices at the Fulton Fish Market 
in New York City. Use the variable log(avgprc) as the dependent variable. 


(i) 


Regress log(avgprc) on four daily dummy variables, with Friday as the base. Include a linear 
time trend. Is there evidence that price varies systematically within a week? 
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(ii) Now, add the variables wave2 and wave3, which are measures of wave heights over the past 
several days. Are these variables individually significant? Describe a mechanism by which 
stormy seas would increase the price of fish. 

(iii) What happened to the time trend when wave2 and wave3 were added to the regression? What 
must be going on? 

(iv) Explain why all explanatory variables in the regression are safely assumed to be strictly 
exogenous. 

(v) Test the errors for AR(1) serial correlation. 

(vi) Obtain the Newey-West standard errors using four lags. What happens to the f statistics on 
wave2 and wave3? Did you expect a bigger or smaller change compared with the usual OLS 
t statistics? 

(vii) Now, obtain the Prais-Winsten estimates for the model estimated in part (11). Are wave2 and 
wave3 jointly statistically significant? 


C10 Use the data in PHILLIPS to answer these questions. 

(i) Using the entire data set, estimate the static Phillips curve equation inf, = Bọ + Bı unem, + u, 
by OLS and report the results in the usual form. 

(ii) Obtain the OLS residuals from part (i), #,, and obtain p from the regression i, on i, ,. (It is fine 
to include an intercept in this regression.) Is there strong evidence of serial correlation? 

(iii) Now estimate the static Phillips curve model by iterative Prais-Winsten. Compare the estimate 
of B, with that obtained in Table 12.2. Is there much difference in the estimate when the later 
years are added? 

(iv) Rather than using Prais-Winsten, use iterative Cochrane-Orcutt. How similar are the final 
estimates of p? How similar are the PW and CO estimates of B,? 


C11 Use the data in NYSE to answer these questions. 
(i) Estimate the model in equation (12.47) and obtain the squared OLS residuals. Find the average, 
minimum, and maximum values of ie over the sample. 
(ii) Use the squared OLS residuals to estimate the following model of heteroskedasticity: 


Var(u,|return,_, return,_>,...) = Var(ureturn,_;) = 89 + 8,return,_, + Syreturn?_,. 


Report the estimated coefficients, the reported standard errors, the R-squared, and the adjusted 
R-squared. 

(ii) Sketch the conditional variance as a function of the lagged return_,. For what value of return_, 
is the variance the smallest, and what is the variance? 

(iv) For predicting the dynamic variance, does the model in part (ii) produce any negative variance 
estimates? 

(v) Does the model in part (ii) seem to fit better or worse than the ARCH(1) model in 
Example 12.9? Explain. 

(vi) To the ARCH(1) regression in equation (12.51), add the second lag, i#?_,. Does this lag seem 
important? Does the ARCH(2) model fit better than the model in part (ii)? 


C12 Use the data in INVEN for this exercise; see also Computer Exercise C6 in Chapter 11. 
(i) Obtain the OLS residuals from the accelerator model Ainven, = By) + B,AGDP, + u, and use 
the regression i, on ii,_, to test for serial correlation. What is the estimate of p? How big a 
problem does serial correlation seem to be? 
(ii) Estimate the accelerator model by PW, and compare the estimate of 6, to the OLS estimate. 
Why do you expect them to be similar? 


C13 Use the data in OKUN to answer this question; see also Computer Exercise C11 in Chapter 11. 
(i) Estimate the equation pcrgdp, = By + B,cunem, + u, and test the errors for AR(1) serial 
correlation, without assuming {cunem, t = 1, 2, .. .} is strictly exogenous. What do you 

conclude? 
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C14 


C15 


C16 


(ii) 


(iii) 


Regress the squared residuals, a7, on cunem, (this is the Breusch-Pagan test for 
heteroskedasticity in the simple regression case). What do you conclude? 

Obtain the heteroskedasticity-robust standard error for the OLS estimate Ê}. Is it substantially 
different from the usual OLS standard error? 


Use the data in MINWAGE for this exercise, focusing on sector 232. 


(i) 


(ii) 


(iii) 


(iv) 


(v) 


(vi) 


Estimate the equation 
gwage232, = Bo + Bigmwage, + Bogcpi; + u, 


and test the errors for AR(1) serial correlation. Does it matter whether you assume gmwage, and 
gcpi, are strictly exogenous? What do you conclude overall? 

Obtain the Newey-West standard error for the OLS estimates in part (i), using a lag of 12. How 
do the Newey-West standard errors compare to the usual OLS standard errors? 

Now obtain the heteroskedasticity-robust standard errors for OLS, and compare them with the 
usual standard errors and the Newey-West standard errors. Does it appear that serial correlation 
or heteroskedasticity is more of a problem in this application? 

Use the Breusch-Pagan test in the original equation to verify that the errors exhibit strong 
heteroskedasticity. 

Add lags | through 12 of gmwage to the equation in part (i). Obtain the p-value for the joint F 
test for lags 1 through 12, and compare it with the p-value for the heteroskedasticity-robust test. 
How does adjusting for heteroskedasticity affect the significance of the lags? 

Obtain the p-value for the joint significance test in part (v) using the Newey-West approach. 
What do you conclude now? 


(vii) If you leave out the lags of gmwage, is the estimate of the long-run propensity much different? 


Use the data in BARIUM to answer this question. 

(i) In Table 12.1 the reported standard errors for OLS are uniformly below those of the 
corresponding standard errors for GLS (Prais-Winsten). Explain why comparing the OLS and 
GLS standard errors is flawed. 

(ii) Reestimate the equation represented by the column labeled “OLS” in Table 12.1 by OLS, but 
now find the Newey-West standard errors using a window g = 4 (four months). How does the 
Newey-West standard error on /chempi compare to the usual OLS standard error? How does it 
compare to the PW standard error? Make the same comparisons for the afdec6 variable. 

(iii) Redo part (ii) now using a window g = 12. What happens to the standard errors on /chempi and 
afdec6 when the window increases from 4 to 12? 

Use the data in APPROVAL to answer the following questions. See also Computer Exercise C14 in 

Chapter 11. 

(i) Estimate the equation 


(ii) 


(iii) 


approve, = By + Blcpifood, + B.lrgasprice, + B,unemploy, + BysepI1, + Bsiraginvade, + u, 


using first differencing and test the errors in the first-differenced (FD) equation for AR(1) serial 
correlation. In particular, let ê, be the OLS residuals in the FD estimation and regress ê, on é,_ 1; 
report the p-value of the test. What is the estimate of p? 

Estimate the FD equation using Prais-Winsten. How does the estimate of 6, compare with the 
OLS estimate on the FD equation? What about its statistical significance? 

Return to estimating the FD equation by OLS. Now obtain the Newey-West standard errors 
using lags of one, four, and eight. Discuss the statistical significance of the estimate of 6, using 
each of the three standard errors. 


e now turn to some more specialized topics that are not usually covered in a one-term, 

introductory course. Some of these topics require few more mathematical skills than 

the multiple regression analysis did in Parts 1 and 2. In Chapter 13, we show how to 
apply multiple regression to independently pooled cross sections. The issues raised are very similar 
to standard cross-sectional analysis, except that we can study how relationships change over time 
by including time dummy variables. Pooled cross sections can be used very effectively for policy 
analysis, where a policy is assigned at a group level and we have not only at least one control group, 
but also periods before and after the intervention. We also illustrate how panel data sets can be ana- 
lyzed in a regression framework. Chapter 14 covers more advanced panel data methods that are 
nevertheless used routinely in applied work. 

Chapters 15 and 16 investigate the problem of endogenous explanatory variables. In Chapter 15, 
we introduce the method of instrumental variables as a way of solving the omitted variable problem 
as well as the measurement error problem. The method of two-stage least squares is used quite 
often in empirical economics and is indispensable for estimating simultaneous equation models, a 
topic we turn to in Chapter 16. 

Chapter 17 covers some fairly advanced topics that are typically used in cross-sectional analy- 
sis, including models for limited dependent variables and methods for correcting sample selection 
bias. Chapter 18 heads in a different direction by covering some recent advances in time series 
econometrics that have proven to be useful in estimating dynamic relationships. 

Chapter 19 should be helpful to students who must write either a term paper or some other 
paper in the applied social sciences. The chapter offers suggestions for how to select a topic, col- 
lect and analyze the data, and write the paper. 


Pooling Cross Sections 
across Time: Simple 
Panel Data Methods 
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ntil now, we have covered multiple regression analysis using pure cross-sectional or pure time 

series data. Although these two cases arise often in applications, data sets that have both cross- 

sectional and time series dimensions are being used more and more often in empirical research. 
Multiple regression methods can still be used on such data sets. In fact, data with cross-sectional and 
time series aspects can often shed light on important policy questions. We will see several examples 
in this chapter. 

We will analyze two kinds of data sets in this chapter. An independently pooled cross section 
is obtained by sampling randomly from a large population at different points in time (usually, but 
not necessarily, different years). For instance, in each year, we can draw a random sample on hourly 
wages, education, experience, and so on, from the population of working people in the United States. 
Or, in every other year, we draw a random sample on the selling price, square footage, number of 
bathrooms, and so on, of houses sold in a particular metropolitan area. From a statistical standpoint, 
these data sets have an important feature: they consist of independently sampled observations. This 
was also a key aspect in our analysis of cross-sectional data: among other things, it rules out correla- 
tion in the error terms across different observations. 

An independently pooled cross section differs from a single random sample in that sampling 
from the population at different points in time likely leads to observations that are not identically 
distributed. For example, distributions of wages and education have changed over time in most countries. 


As we will see, this is easy to deal with in practice by allowing the intercept in a multiple regression 
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model, and in some cases the slopes, to change over time. We cover such models in Section 13-1. In 
Section 13-1, we discuss how pooling cross sections over time can be used to evaluate policy changes. 

A panel data set, while having both a cross-sectional and a time series dimension, differs in some 
important respects from an independently pooled cross section. To collect panel data—sometimes 
called longitudinal data—we follow (or attempt to follow) the same individuals, families, firms, cit- 
ies, states, or whatever, across time. For example, a panel data set on individual wages, hours, educa- 
tion, and other factors is collected by randomly selecting people from a population at a given point in 
time. Then, these same people are reinterviewed at several subsequent points in time. This gives us 
data on wages, hours, education, and so on, for the same group of people in different years. 

Panel data sets are fairly easy to collect for school districts, cities, counties, states, and countries, 
and policy analysis is greatly enhanced by using panel data sets; we will see some examples in the 
following discussion. For the econometric analysis of panel data, we cannot assume that the obser- 
vations are independently distributed across time. For example, unobserved factors (such as ability) 
that affect someone’s wage in 1990 will also affect that person’s wage in 1991; unobserved factors 
that affect a city’s crime rate in 1985 will also affect that city’s crime rate in 1990. For this reason, 
special models and methods have been developed to analyze panel data. In Sections 13-3, 13-4, and 
13-5, we describe the straightforward method of differencing to remove time-constant, unobserved 
attributes of the units being studied. Because panel data methods are somewhat more advanced, we 
will rely mostly on intuition in describing the statistical properties of the estimation procedures, leav- 
ing detailed assumptions to the chapter appendix. We follow the same strategy in Chapter 14, which 


covers more complicated panel data methods. 


13-1 Pooling Independent Cross Sections across Time 


Many surveys of individuals, families, and firms are repeated at regular intervals, often each year. An 
example is the Current Population Survey (or CPS), which randomly samples households each year. 
(See, for example, CPS78_85, which contains data from the 1978 and 1985 CPS.) If a random sample 
is drawn at each time period, pooling the resulting random samples gives us an independently pooled 
cross section. 

One reason for using independently pooled cross sections is to increase the sample size. By pool- 
ing random samples drawn from the same population, but at different points in time, we can get more 
precise estimators and test statistics with more power. Pooling is helpful in this regard only insofar as 
the relationship between the dependent variable and at least some of the independent variables remain 
constant over time. 

As mentioned in the introduction, using pooled cross sections raises only minor statistical com- 
plications. Typically, to reflect the fact that the population may have different distributions in different 
time periods, we allow the intercept to differ across periods, usually years. This is easily accom- 
plished by including dummy variables for all but one year, where the earliest year in the sample is 
usually chosen as the base year. It is also possible that the error variance changes over time, some- 
thing we discuss later. 

Sometimes, the pattern of coefficients on the year dummy variables is itself of interest. For exam- 
ple, a demographer may be interested in the following question: After controlling for education, has 
the pattern of fertility among women over age 35 changed between 1972 and 1984? The following 
example illustrates how this question is simply answered by using multiple regression analysis with 
year dummy variables. 
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Women’s Fertility over Time 


The data set in FERTIL1, which is similar to that used by Sander (1992), comes from the National 
Opinion Research Center’s General Social Survey for the even years from 1972 to 1984, inclusively. 
We use these data to estimate a model explaining the total number of kids born to a woman (kids). 

One question of interest is: After controlling for other observable factors, what has happened to 
fertility rates over time? The factors we control for are years of education, age, race, region of the 
country where living at age 16, and living environment at age 16. The estimates are given in Table 13.1. 

The base year is 1972. The coefficients on the year dummy variables show a sharp drop in fertil- 
ity in the early 1980s. For example, the coefficient on yS2 implies that, holding education, age, and 
other factors fixed, a woman had on average .52 less children, or about one-half a child, in 1982 than 
in 1972. This is a very large drop: holding educ, age, and the other factors fixed, 100 women in 1982 
are predicted to have about 52 fewer children than 100 comparable women in 1972. Because we are 
controlling for education, this drop is separate from the decline in fertility that is due to the increase in 
average education levels. (The average years of education are 12.2 for 1972 and 13.3 for 1984.) The 
coefficients on y82 and y&4 represent drops in fertility for reasons that are not captured in the explana- 
tory variables. 

Given that the 1982 and 1984 year dummies are individually quite significant, it is not surprising 
that as a group the year dummies are jointly very significant: the R-squared for the regression without 
the year dummies is .1019, and this leads to F¢ ,,;, = 5.87 and p-value = 0. 


TABLE 13.1 Determinants of Women’s Fertility 


Dependent Variable: kids 

Independent Variables Coefficients Standard Errors 
educ —.128 .018 
age 532 138 
age? —.0058 .0016 
black 1.076 174 
east 217 133 
northcen 363 121 
west 198 167 
farm —.053 147 
othrural —.163 175 
town .084 124 
smcity e212 160 
y74 .268 -173 
y76 —.097 179 
y78 —.069 182 
y80 —.071 183 
y82 —.522 172 
y84 —.545 175 
constant —7.742 3.052 
n= 1,129 

R? = 1295 

RE = 1162 
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Women with more education have fewer children, and the estimate is very statistically signifi- 
cant. Other things being equal, 100 women with a college education will have about 51 fewer children 
on average than 100 women with only a high school education: .128(4) = .512. Age has a diminish- 
ing effect on fertility. (The turning point in the quadratic is at about age = 46, by which time most 
women have finished having children.) 

The model estimated in Table 13.1 assumes that the effect of each explanatory variable, particu- 
larly education, has remained constant. This may or may not be true; you will be asked to explore this 
issue in Computer Exercise Cl. 

Finally, there may be heteroskedasticity in the error term underlying the estimated equation. This 
can be dealt with using the methods in Chapter 8. There is one interesting difference here: now, the 
error variance may change over time even if it does not change with the values of educ, age, black, 
and so on. The heteroskedasticity-robust standard errors and test statistics are nevertheless valid. The 
Breusch-Pagan test would be obtained by regressing the squared OLS residuals on all of the inde- 
pendent variables in Table 13.1, including the year dummies. (For the special case of the White sta- 
tistic, the fitted values kids and the squared fitted values are used as the independent variables, as 
always.) A weighted least squares procedure should account for variances that possibly change over 
time. In the procedure discussed in Section 8-4, year dummies would be included in equation (8.32). 


GOING FURTHER 13.1 We can also interact a year dummy variable 
with key explanatory variables to see if the effect of 


that variable has changed over a certain time period. 
black woman is expected to have one more The next example examines how the return to edu- 
child than a nonblack woman. Do you agree cation and the gender gap have changed from 1978 
with this claim? to 1985. 


In reading Table 13.1, someone claims that, 
if everything else is equal in the table, a 


Changes in the Return to Education and the Gender Wage Gap 


A log(wage) equation (where wage is hourly wage) pooled across the years 1978 (the base year) and 
1985 is 


log(wage) = By + Spy85 + Byeduc + 8,y85-educ + Byexper 


13.1 
+ B,exper? + Bunion + B.female + 8;y85-female + u, [3.1] 


where most explanatory variables should by now be familiar. The variable union is a dummy vari- 
able equal to one if the person belongs to a union, and zero otherwise. The variable y85 is a dummy 
variable equal to one if the observation comes from 1985 and zero if it comes from 1978. There are 
550 people in the sample in 1978 and a different set of 534 people in 1985. 

The intercept for 1978 is Bọ, and the intercept for 1985 is Bọ + ô. The return to education in 
1978 is B,, and the return to education in 1985 is 6, + 6,. Therefore, 6, measures how the return to 
another year of education has changed over the seven-year period. Finally, in 1978, the log(wage) dif- 
ferential between women and men is 85; the differential in 1985 is B; + 6;. Thus, we can test the null 
hypothesis that nothing has happened to the gender differential over this seven-year period by testing 
Ho: 6; = 0. The alternative that the gender differential has been reduced is H,: 6; > 0. For simplicity, 
we have assumed that experience and union membership have the same effect on wages in both time 
periods. 

Before we present the estimates, there is one other issue we need to address—namely, hourly 
wage here is in nominal (or current) dollars. Because nominal wages grow simply due to inflation, 
we are really interested in the effect of each explanatory variable on real wages. Suppose that we set- 
tle on measuring wages in 1978 dollars. This requires deflating 1985 wages to 1978 dollars. (Using 
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the Consumer Price Index for the 1997 Economic Report of the President, the deflation factor is 
107.6/65.2 ~ 1.65.) Although we can easily divide each 1985 wage by 1.65, it turns out that this is 
not necessary, provided a 1985 year dummy is included in the regression and log(wage) (as opposed 
to wage) is used as the dependent variable. Using real or nominal wage in a logarithmic functional 
form only affects the coefficient on the year dummy, y85. To see this, let PS5 denote the deflation fac- 
tor for 1985 wages (1.65, if we use the CPI). Then, the log of the real wage for each person i in the 
1985 sample is 


log(wage,/P85) = log(wage;) — log( P85). 


Now, while wage; differs across people, P85 does not. Therefore, log(P85) will be absorbed into the 
intercept for 1985. (This conclusion would change if, for example, we used a different price index for 
people living in different parts of the country.) The bottom line is that, for studying how the return to 
education or the gender gap has changed, we do not need to turn nominal wages into real wages in 
equation (13.1). Computer Exercise C2 asks you to verify this for the current example. 

If we forget to allow different intercepts in 1978 and 1985, the use of nominal wages can produce 
seriously misleading results. If we use wage rather than log(wage) as the dependent variable, it is 
important to use the real wage and to include a year dummy. 

The previous discussion generally holds when using dollar values for either the dependent or 
independent variables. Provided the dollar amounts appear in logarithmic form and dummy variables 
are used for all time periods (except, of course, the base period), the use of aggregate price deflators 
will only affect the intercepts; none of the slope estimates will change. 

Now, we use the data in CPS78_85 to estimate the equation: 


log(wage) = .459 + .118 y85 + .0747 educ + .0185 y85-educ 


(.093) (.124) (.0067) (.0094) 
+ .0296 exper — .00040 exper? + .202 union 

(.0036) (.00008) (.030) [13.2] 
— .317 female + .085 y&5-:female 

(.037) (.051) 


n = 1,084; R? = .426; R? = .422. 
The return to education in 1978 is estimated to be about 7.5%; the return to education in 1985 is 
about 1.85 percentage points higher, or about 9.35%. Because the f statistic on the interaction term is 
.0185/.0094 ~ 1.97, the difference in the return to education is statistically significant at the 5% level 
against a two-sided alternative. 

What about the gender gap? In 1978, other things being equal, a woman earned about 
31.7% less than a man (27.2% is the more accurate estimate). In 1985, the gap in log(wage) is 
—.317 + .085 = —.232. Therefore, the gender gap appears to have fallen from 1978 to 1985 by 
about 8.5 percentage points. The f statistic on the interaction term is about 1.67, which means it is 
significant at the 5% level against the positive one-sided alternative. 


What happens if we interact all independent variables with y85 in equation (13.2)? This is identi- 
cal to estimating two separate equations, one for 1978 and one for 1985. Sometimes, this is desirable. 
For example, in Chapter 7, we discussed a study by Krueger (1993), in which he estimated the return 
to using a computer on the job. Krueger estimates two separate equations, one using the 1984 CPS 
and the other using the 1989 CPS. By comparing how the return to education changes across time 
and whether or not computer usage is controlled for, he estimates that one-third to one-half of the 
observed increase in the return to education over the five-year period can be attributed to increased 
computer usage. [See Tables VIII and IX in Krueger (1993).] 
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13-1a The Chow Test for Structural Change across Time 


In Chapter 7, we discussed how the Chow test—which is simply an F test—can be used to determine 
whether a multiple regression function differs across two groups. We can apply that test to two 
different time periods as well. One form of the test obtains the sum of squared residuals from the 
pooled estimation as the restricted SSR. The unrestricted SSR is the sum of the SSRs for the two 
separately estimated time periods. The mechanics of computing the statistic are exactly as they were 
in Section 7-4. A heteroskedasticity-robust version is also available (see Section 8-2). 

Example 13.2 suggests another way to compute the Chow test for two time periods by interact- 
ing each variable with a year dummy for one of the two years and testing for joint significance of 
the year dummy and all of the interaction terms. Because the intercept in a regression model often 
changes over time (due to, say, inflation in the housing price example), this full-blown Chow test can 
detect such changes. It is usually more interesting to allow for an intercept difference and then to test 
whether certain slope coefficients change over time (as we did in Example 13.2). 

A Chow test can also be computed for more than two time periods. Just as in the two-period case, 
it is usually more interesting to allow the intercepts to change over time and then test whether the 
slope coefficients have changed over time. We can test the constancy of slope coefficients generally 
by interacting all of the time-period dummies (except that defining the base group) with one, several, 
or all of the explanatory variables and test the joint significance of the interaction terms. Computer 
Exercises C1 and C2 are examples. For many time periods and explanatory variables, constructing a 
full set of interactions can be tedious. Alternatively, we can adapt the approach described in part (vi) 
of Computer Exercise C11 in Chapter 7. First, estimate the restricted model by doing a pooled 
regression allowing for different time intercepts; this gives SSR,. Then, run a regression for each of 
the, say, T time periods and obtain the sum of squared residuals for each time period. The unrestricted 
sum of squared residuals is obtained as SSR,,. = SSR; + SSR, + +- + SSRz. If there are k explana- 
tory variables (not including the intercept or the time dummies) with T time periods, then we are test- 
ing (T — 1)k restrictions, and there are T + Tk parameters estimated in the unrestricted model. So, if 
n =n, +m ++ + npis the total number of observations, then the df of the F test are (T — 1)k and 
n — T — Tk. We compute the F statistic as usual: [(SSR, — SSR,,.)/SSR,,.][(m — T — Tk)/(T — 1)k]. 
Unfortunately, as with any F test based on sums of squared residuals or R-squareds, this test 
is not robust to heteroskedasticity (including changing variances across time). To obtain a 
heteroskedasticity-robust test, we must construct the interaction terms and do a pooled regression. 


13-2 Policy Analysis with Pooled Cross Sections 


Pooled cross sections can be very useful for evaluating the impact of a certain event or policy. The fol- 
lowing example of an event study shows how two cross-sectional data sets, collected before and after 
the occurrence of an event, can be used to determine the effect on economic outcomes. 


Effect of a Garbage Incinerator’s Location on Housing Prices 


Kiel and McClain (1995) studied the effect that a new garbage incinerator had on housing values in 
North Andover, Massachusetts. They used many years of data and a fairly complicated econometric 
analysis. We will use two years of data and some simplified models, but our analysis is similar. 

The rumor that a new incinerator would be built in North Andover began after 1978, and 
construction began in 1981. The incinerator was expected to be in operation soon after the start of 
construction; the incinerator actually began operating in 1985. We will use data on prices of houses 
that sold in 1978 and another sample on those that sold in 1981. The hypothesis is that the price of 
houses located near the incinerator would fall relative to the price of more distant houses. 
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For illustration, we define a house to be near the incinerator if it is within three miles. [In 
Computer Exercise C3, you are instead asked to use the actual distance from the house to the incin- 
erator, as in Kiel and McClain (1995).] We will start by looking at the dollar effect on housing prices. 
This requires us to measure price in constant dollars. We measure all housing prices in 1978 dollars, 
using the Boston housing price index. Let rprice denote the house price in real terms. 

A naive analyst would use only the 1981 data and estimate a very simple model: 


rprice = Yo + y\nearinc + u, [13.3] 


where nearinc is a binary variable equal to one if the house is near the incinerator, and zero otherwise. 
Estimating this equation using the data in KIELMC gives 


rprice = 101,307.5 — 30,688.27 nearinc 
(3,093.0) (5,827.71) [13.4] 
n = 142, R? = .165. 


Because this is a simple regression on a single dummy variable, the intercept is the average selling 
price for homes not near the incinerator, and the coefficient on nearinc is the difference in the average 
selling price between homes near the incinerator and those that are not. The estimate shows that the 
average selling price for the former group was $30,688.27 less than for the latter group. The f statistic 
is greater than five in absolute value, so we can strongly reject the hypothesis that the average value 
for homes near and far from the incinerator are the same. 

Unfortunately, equation (13.4) does not imply that the siting of the incinerator is causing the 
lower housing values. In fact, if we run the same regression for 1978 (before the incinerator was even 
rumored), we obtain 


rprice = 82,517.23 — 18,824.37 nearinc 
(2,653.79) (4,744.59) [13.5] 
n = 179, R? = 082, 


Therefore, even before there was any talk of an incinerator, the average value of a home near the site 
was $18,824.37 less than the average value of a home not near the site ($82,517.23); the difference 
is statistically significant, as well. This is consistent with the view that the incinerator was built in an 
area with lower housing values. 

How, then, can we tell whether building a new incinerator depresses housing values? The key is 
to look at how the coefficient on nearinc changed between 1978 and 1981. The difference in aver- 
age housing value was much larger in 1981 than in 1978 ($30,688.27 versus $18,824.37), even as a 
percentage of the average value of homes not near the incinerator site. The difference in the two coef- 
ficients on nearinc is 


8, = —30,688.27 — (—18,824.37) = —11,863.9. 


This is our estimate of the effect of the incinerator on values of homes near the incinerator site. In 


A 


empirical economics, 6, has become known as the difference-in-differences (DD or DID) estimator 
because it can be expressed as 


ô; = (rpricesy, nr — rprices, f) — (rpricezg, n: — rpriceyg, fp), [13.6] 


where nr stands for “near the incinerator site” and fr stands for “farther away from the site.” In other 
words, ô , is the difference over time in the average difference of housing prices in the two locations. 

To test whether ô , İs Statistically different from zero, we need to find its standard error by using a 
regression analysis. In fact, ô can be obtained by estimating 


rprice = Bo + ôoy81 + Bınearinc + 6,y8I-nearinc + u, [13.7] 
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using the data pooled over both years. The intercept, Bo, is the average price of a home not near the incin- 
erator in 1978. The parameter 6, captures changes in all housing values in North Andover from 1978 to 
1981. [A comparison of equations (13.4) and (13.5) shows that housing values in North Andover, rela- 
tive to the Boston housing price index, increased sharply over this period.] The coefficient on nearinc, 
Bı, measures the location effect that is not due to the presence of the incinerator: as we saw in equation 
(13.5), even in 1978, homes near the incinerator site sold for less than homes farther away from the site. 

The parameter of interest is on the interaction term y8/-nearinc: 6, measures the decline in hous- 
ing values due to the new incinerator, provided we assume that houses both near and far from the site 
did not appreciate at different rates for other reasons. 

The estimates of equation (13.7) are given in column (1) of Table 13.2. The only number we 
could not obtain from equations (13.4) and (13.5) is the standard error of ô.. The ż statistic on 5, is 
about — 1.59, which is marginally significant against a one-sided alternative (p-value ~ .057). 

Kiel and McClain (1995) included various housing characteristics in their analysis of the incinera- 
tor siting. There are two good reasons for doing this. First, the kinds of homes selling near the incinera- 
tor in 1981 might have been systematically different than those selling near the incinerator in 1978; if 
so, it can be important to control for such characteristics. Second, even if the relevant house charac- 
teristics did not change, including them can greatly reduce the error variance, which can then shrink 
the standard error of ô.. (See Section 6-3 for discussion.) In column (2), we control for the age of the 
houses, using a quadratic. This substantially increases the R-squared (by reducing the residual vari- 
ance). The coefficient on y87-nearinc is now much larger in magnitude, and its standard error is lower. 

In addition to the age variables in column (2), column (3) controls for distance to the inter- 
state in feet (intst), land area in feet (land), house area in feet (area), number of rooms (rooms), 
and number of baths (baths). This produces an estimate on y8/-nearinc closer to that with- 
out any controls, but it yields a much smaller standard error: the ¢ statistic for ô is about —2.84. 
Therefore, we find a much more significant effect in column (3) than in column (1). The column (3) 
estimates are preferred because they control for the most factors and have the smallest standard 
errors (except in the constant, which is not important here). The fact that nearinc has a much smaller 
coefficient and is insignificant in column (3) indicates that the characteristics included in column (3) 
largely capture the housing characteristics that are most important for determining housing prices. 

For the purpose of introducing the method, we used the level of real housing prices in Table 13.2. 
It makes more sense to use log(price) [or log(rprice)] in the analysis in order to get an approximate 
percentage effect. The basic model becomes 


log(price) = By + Spy81 + Bınearinc + 6,y81-nearinc + u. [13.8] 


TABLE 13.2 Effects of Incinerator Location on Housing Prices 


Dependent Variable: rprice 


Independent Variable (1) (2) (3) 
constant 82,517.23 89,116.54 13,807.67 
(2,726.91) (2,406.05) (11,166.59) 
y81 18,790.29 21,321.04 13,928.48 
(4,050.07) (3,443.63) (2,798.75) 
nearinc —18,824.37 9,397.94 3,780.34 
(4,875.32) (4,812.22) (4,453.42) 
y81-nearinc —11,863.90 —21,920.27 -14,177.93 
(7,456.65) (6,359.75) (4,987.27) 
Other controls No age, age” Full Set 
Observations 321 321 321 


A-squared 174 414 .660 
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Now, 100-6, is the approximate percentage reduction in housing value due to the incinerator. [Just as 
in Example 13.2, using log(price) versus log(rprice) only affects the coefficient on y8/.] Using the 
same 321 pooled observations gives 
log(price) = 11.29 + .457 y81 — .340 nearinc — .063 y81-nearinc 
(.31) (.045) (.055) (.083) [13.9] 
n = 321, R? = .409. 


The coefficient on the interaction term implies that, because of the new incinerator, houses near the 
incinerator lost about 6.3% in value. However, this estimate is not statistically different from zero. 
But when we use a full set of controls, as in column (3) of Table 13.2 (but with intst, land, and area 
appearing in logarithmic form), the coefficient on y8/-nearinc becomes —.132 with a t statistic of 
about —2.53. Again, controlling for other factors turns out to be important. Using the logarithmic 
form, we estimate that houses near the incinerator were devalued by about 13.2%. 


The methodology used in the previous example has numerous applications, especially when 
the data arise from a natural experiment (or a quasi-experiment). A natural experiment occurs 
when some exogenous event—often a change in government policy—changes the environment 
in which individuals, families, firms, or cities operate. A natural experiment always has a control 
group, which is not affected by the policy change, and a treatment group, which is thought to be 
affected by the policy change. Unlike a true experiment, in which treatment and control groups are 
randomly and explicitly chosen, the control and treatment groups in natural experiments arise from 
the particular policy change. To control for systematic differences between the control and treat- 
ment groups, we need two years of data, one before the policy change and one after the change. 
Thus, our sample is usefully broken down into four groups: the control group before the change, 
the control group after the change, the treatment group before the change, and the treatment group 
after the change. 

Call C the control group and T the treatment group, letting dT equal unity for those in the treat- 
ment group 7, and zero otherwise. Then, letting d2 denote a dummy variable for the second (post- 
policy change) time period, the equation of interest is 


y = By + 69d2 + BdT + 6,d2-dT + other factors, [13.10] 


where y is the outcome variable of interest. As in Example 13.3, 6, measures the effect of the policy. 
Without other factors in the regression, 6, will be the difference-in-differences estimator: 


ô, = (Yor Y2) (Vir Yic)s [13.11] 


where the bar denotes average, the first subscript denotes the year, and the second subscript denotes 
the group. By simple rearrangement of (13.11), we can also write 


ô, = ar Yir) (Yzc Yic)s [13.12] 


which provides a different interpretation of the DD estimator. The first term, y. 7 — y,,7, is the differ- 
ence in means over time for the treated group. This quantity would be a good estimator of the policy 
effect only if we can assume no external factors changed across the two time periods. (It is a before- 
after estimator applied to just the treated group.) To guard against this possibility, we compute the 
same trend in averages for the control group, yc — Y,,c. By subtracting this from yzy — yı,r we hope 
to get a good estimator of the causal impact of the program or intervention. 

The standard difference-in-differences setup is shown in Table 13.3. Like equations (13.11) and 
(13.12), Table 13.3 shows that the parameter 6, can be estimated in two ways: (1) Compute the differ- 
ences in averages between the treatment and control groups in each time period, and then difference 
the results over time, as in equation (13.11); (2) Compute the change in averages over time for each of 
the treatment and control groups, and then difference these changes, as in equation (13.12). Naturally, 
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TABLE 13.3 Illustration of the Difference-in-Differences Estimator 


Before After After — Before 
Control Bo Bo + ôo do 
Treatment Bo + Bi Bo + ôo + By + 6; ôo + ôi 
Treatment—Control By B, + 6, ĉi 


the estimate 6 1» which we can also be written as Onn does not depend on how we do the differencing, 
but it is helpful to have the two different interpretations. 

We can make a connection between the difference-in-differences framework and the potential out- 
comes framework that we discussed in previous chapters; see, for example, Section 7.6. The param- 
eter ô can be given an interpretation as an average treatment effect (ATE), where the “treatment” is 
being in group T in the second time period. 

When explanatory variables are added to equation (13.10) (to control for the fact that the popula- 
tions sampled may differ systematically over the two periods), the OLS estimate of 6, no longer has 
the simple form of (13.11), but its interpretation is similar. 


Effect of Worker Compensation Laws on Weeks out of Work 


Meyer, Viscusi, and Durbin (1995) (hereafter, MVD) studied the length of time (in weeks) that an injured 
worker receives workers’ compensation. On July 15, 1980, Kentucky raised the cap on weekly earnings 
that were covered by workers’ compensation. An increase in the cap has no effect on the benefit for low- 
income workers, but it makes it less costly for a high-income worker to stay on workers’ compensation. 
Therefore, the control group is low-income workers, and the treatment group is high-income workers; 
high-income workers are defined as those who were subject to the pre-policy change cap. Using ran- 
dom samples both before and after the policy change, MVD were able to test whether more generous 
workers’ compensation causes people to stay out of work longer (everything else fixed). They started 
with a difference-in-differences analysis, using log(durat) as the dependent variable. Let afchnge be the 
dummy variable for observations after the policy change and highearn the dummy variable for high 
earners. Using the data in INJURY, the estimated equation, with standard errors in parentheses, is 


log(durat) = 1.126 + .0077 afchnge + .256 highearn 
(0.031) (.0447) (.047) 
+ 191 afchnge-highearn [13.13] 
(.069) 
n = 5,626; R? = .021. 


Therefore, ôi = .191(t = 2.77), which implies that the average length of time on workers’ compen- 
sation for high earners increased by about 19% due to the increased earnings cap. The coefficient on 
afchnge is small and statistically insignificant: as is expected, the increase in the earnings cap has no 
effect on duration for low-income workers. 

This is a good example of how we can get a fairly precise estimate of the effect of a policy change 
even though we cannot explain much of the variation in the dependent variable. The dummy variables 
in (13.13) explain only 2.1% of the variation in log(durat). This makes sense: there are clearly many 
factors, including severity of the injury, that affect how long someone receives workers’ compensa- 
tion. Fortunately, we have a very large sample size, and this allows us to get a significant ¢ statistic. 

MVD also added a variety of controls for gender, marital status, age, industry, and type of injury. 
This allows for the fact that the kinds of people and types of injuries may differ systematically by 
earnings group across the two years. Controlling for these factors turns out to have little effect on the 
estimate of 6,. (See Computer Exercise C4.) 
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GOING FURTHER 13.2 


Sometimes, the two groups consist of people living in two neigh- 
boring states in the United States. For example, to assess the impact 


What do you make of the coefficient and t | of changing cigarette taxes on cigarette consumption, we can obtain 
statistic on highearn in equation (13.13)? random samples from two states for two years. In State A, the control 


group, there was no change in the cigarette tax. In State B, the treat- 
ment group, the tax increased (or decreased) between the two years. 
The outcome variable would be a measure of cigarette consumption, and equation (13.10) can be 
estimated to determine the effect of the tax on cigarette consumption. 

For an interesting survey on natural experiment methodology and several additional examples, 
see Meyer (1995). 


13-2a Adding an Additional Control Group 


One of the shortcomings of the traditional two-group, two-period difference-in-differences setup is 
that it assumes that any trends in the outcome, y, would trend at the same rate in the absence of the 
intervention. (This could be a positive or negative trend.) For example, suppose we are studying the 
effects of expanded health care for low-income families in a particular state. As in the Meyer, Viscusi, 
and Durbin (1995) application, we might use as a control group middle-income families that are not 
impacted by the policy change. In using the basic DD setup, we would have to assume that average 
health trends would be the same for the low-income and middle-income families in the absence of 
the intervention. This assumption is often known as the parallel trends assumption. Violation of 
the parallel trends assumption is a threat to the identification strategy used by DD, as can be seen by 
studying the expression for Opp given in equation (13.12): the DD estimate is simply the difference in 
the estimated trends for the treatment and control groups. 

One way to allow more flexibility is to collect information on a different control group. For 
example, suppose that in another state there was no intervention. If we think any differences in trends 
in health outcomes between low- and middle-income families is similar across states, we can include 
the state without the intervention as a control to obtain a more convincing estimate. 

To be more concrete, let L denote low-income families and M middle-income families. Let B 
denote the state where the intervention occured and A the control state. Let dL be a dummy variable 
indicating low-income families, dB a dummy variable indicating state B, and d2 a dummy variable for 
the second time period. Now we estimate the equation 


y = By + Bid + BodB + BdL + dB + 8od2 [13.14] 
+ 8,\d2 + dL + 8,d2 + dB + 8,d2+dL+ dB + u, 


where y is some measure of health outcomes. Equation (13.14) includes each dummy variable sepa- 
rately, the three pairwise interactions, and the triple interaction term, d2 * dL * dB. This last term is the 
treatment indicator; the other terms act as controls that allow differences across time, income group, 
and state. Note that the trend for low-income people is allowed to be different from middle-income peo- 
ple through the term d2 ° dL, but in order to interpret ô, as the policy effect, we assume that any differ- 
ence in trends between the L and M groups is the same across states in the absence of the intervention. 

The easiest way to interpret the above equation is to study the OLS estimator of ô}. After some 
tedious algebra, it can be shown that 


ô; = [Wore Yip) (Y2m.B Yims) [13.15] 
B [Oza Yia) (Yzma Yima)] 


= Opps Sppa = 8ppp- 


The first term in brackets is the usual DD estimator using only the state that imposed the new policy. 
It uses as a control group middle-income families from the same state. The second term is the DD 
estimator in the state not imposing the new policy. If health trends between the L and M groups do not 
differ in state A, and there were no other intervention that would affect health outcomes, then ok A 
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should be roughly zero. In general, we estimate the policy effect by subtracting — from ô pp. as a 
way of accounting for possibly different trends in the L and M groups. If the differing trends in L and 
M also differ by state then even (13.15) will not produce a consistent estimator of the policy effect. 

The estimator 5; is usually called the difference-in-difference-in-differences (DDD) estimator, 
and can be denoted a We obtain two DD estimators and then difference those. In obtaining the 
DDD estimator, it is convenient to use OLS applied to equation (13.14) because heteroskedasticity- 
robust standard errors are easy to obtain. Plus, as in the DD case, we can include control variables 
Xi, -< -> X% either to account for compositional effects or to reduce the error variance in order to 
improve precision of ô DDD: 


13-2b A General Framework for Policy Analysis 
with Pooled Cross Sections 


Another way to expand the basic DD methodology is to obtain multiple control and treatment groups 
as well as more than two time periods. We can create a very general framework for policy analysis by 
allowing a general pattern of interventions, where some units are never “treated” and others may be 
treated in different time periods. It is even possible that early in the study some units are subject to a 
policy but then later on the policy is dropped. As a word of warning, with general patterns of interven- 
tion it is a mistake to try to fit the problem in the basic DD or even DDD frameworks. 

It is helpful to introduce an i subscript to represent an individual unit, which could be a person, 
a family, a firm, school, and so on. Each i belongs to a pair (g, t), where g is a group and ż denotes a 
time period. Often the groups are based on geography, such as a city, county, state, or province, but 
we have already seen examples where the groups can be something else (low earners and high earn- 
ers, for example). Most commonly, t represents a year, but it could be much shorter than a year or we 
could have time periods spread out more than a year apart. 

In the general setting, we are interested in a policy intervention that applies at the group level. In order 
to be convincing, there should be a before-after period for at least some of the groups. Other groups may 
be control groups in that the policy is never implemented. In the simplest case, the policy is indicated by 
a dummy variable, say x,,, which is one if group g in year fis subject to the policy intervention, and zero 
otherwise. It is very important to properly code this variable before undertaking any analysis. With many 
groups and time periods, the pattern of zeros and ones for x,, can be complex. The complexity, which can 
result from policy interventions being staggered across groups, and policies being rescinded, adds power 
to our ability to estimate the effects of policy changes. It is important to not think that x,, can always be 
constructed by interacting dummy variables indicating groups and time periods, as in the basic DD setup. 
Given the policy variable x,,, we can now write down an equation that can be used to estimate the 
policy effect. A flexible (but not completely flexible) model is 


Vig = Ay t ag Pig T Ligh t ged = lyssa Na [13.16] 


g= less Gi E 1, 


where the notation shows that the group/time cell (g, t) has N,, observations. The variable y;., is meas- 
ured at the unit level, as are the explanatory variables z;,,. Recall that z;,y is shorthand for several 
explanatory variables multiplied by a coefficient. 

The parameters A, are the aggregate time effects that capture external factors. For example, if g 
indexes states, the A, can be country-wide factors that affect all states equally. In most policy studies 
it is very important to account for such factors, as policy implementation tends to be bunched in dif- 
ferent time periods. If we do not include the A,, then we may spuriously conclude that a policy had an 
impact. Or, we may estimate little or no policy effect where there is one. The group effects, a,—for 
example, a state effect if g indexes states—account for systematic differences in groups that are con- 
stant across time. Policy implementation tends to depend on group characteristics that we may not 
be able to fully measure, and these same factors may influence yj. This is true whether g represents 
entities such as counties and states or whether they are, say, different age groups or income groups. 
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In estimating (13.16), we account for the time and group effects simply by including dummy 
variables. In other words, we define dummy variables, say dt, for each time period, and group 
dummy variables, dg, for each group. In practice, one includes an intercept and excludes one group 
and one time period (usually the first, but any one will do.) Then we estimate (13.16) using pooled 
OLS, where the pooling is across all individuals across all (g, t) pairs. The coefficient of interest 
is B. The variables z;,, can include measured variables that change only at the (g, £) level but also, 
as the i subscript indicates, individual-specific covariates. When we take the policy assignment as 
fixed and view estimation uncertainty through the sampling error, proper inference is obtained using 
heteroskedasticity-robust standard errors in the pooled OLS estimation. 

The setup in equation (13.16) can be applied to important problems such as studying the labor 
market impacts of minimum wages. In the United States, minimum wages can vary at the city level, in 
which case g indexes county, although it is most common to study state level variation. The individual 
outcomes y;,, can be hourly wage (probably its log) or employment status. It could be very important to 
account for both time and city (or state) effects. In addition, we might have information on education, 
workforce experience, and background variables for individuals; these controls would be included in Z;,,. 

Equation (3.16) imposes its own version of a parallel trend assumption because the effects A, 
have the same impact for all groups g. In particular, drop the variables z;,, and set x, to zero for all 
(e, t). Then, for a given group g, the mean value of y,,, is simply traced about by 4,. One way to relax 
that assumption is to use group-specific linear time trends, at least if we have T = 3 time periods. In 
particular, we replace (13.16) with 


Ving = Ar + Ag + Pet + Bae Ligy H Uige [13.17] 


where w, captures the linear trend for group g. Notice that we still want aggregate effects included 
because the term w,t imposes linear trends on each group. We still want the A, to account for nonlinear 
aggregate time effects. In estimation, we will lose another A, because we have partly accounted for 
aggregate time effects with the group-specific trend. 

In the minimum wage example, it is easy to imagine that minimum wages and labor market out- 
comes have trends that differ by city (or state). 

Why should we use only linear group-specific trends? In fact, with lots of time periods we can 
include more complicated trends, such as a group-specific quadratic. But the more terms we include, 
the more variation in the policy indicator x,, we require in order to pin down any effects of the policy. 
In the extreme case, one might think of including separate dummy variables for all (g, t) pairs, repre- 
sented by, say, 

Yigt = Oo + PX, + ZigtY + Uigt> [13.18] 
where 6,, is a different intercept for each (g, t) pair. Such a formulation is more general than any of 
the previous equations, including (13.17). Unfortunately, it is also useless for estimating £ because x,, 
varies only at the (g, t) level, and is, therefore, perfectly collinear with the intercepts. 

If interest lies in elements of y—for example, the policy of interest applies to different units 
within at least some (g, t) pairs—then (13.18) becomes attractive. Operationally, rather than just 
including separate dummies for time and separate dummies for groups, as in (13.16), one includes a 
full set of time-group interactions: dt * dg for allt = 1,..., Tandg = 1,..., G. This allows each 
group to have its own very flexible time trend. 

Extensions to more than one policy variable are straightforward, and the policy variables need 
not be binary indicators. For example, the vector x,, can include, say, the state-level minimum wage 
for state g in year t, along with a dummy variable equal to unity if the state is a right-to-work state. 
Or, maybe we have individual-level health outcomes, y;,,, and x,, is a vector (collection) of state-level 
health policy variables, which could be continuous or discrete. Then (13.16) becomes 


Vig T Ar F Qg + Xub + yy H thy [13.19] 
=A, + a, + BiXer1 ae SSP BX ork F Yig T + VeZiets + Uigr- 
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The 6; measure the (ceteris paribus) effect of different policies. If we think the policy takes time to 
have its full impact, we can even include a lagged policy indicator. For example, if p,, is the policy 
indicator in time period t, we might estimate an equation such as 


Yigt = A, + a, + BiPer + BoPoy-1 + P3Pe1—2 F Yig Fo + Yig T Uig 


Naturally, including lags requires more time periods. Generally, equation (13.17) gets modified in a 
similar way. 


13-3 Two-Period Panel Data Analysis 


We now turn to the analysis of the simplest kind of panel data: for a cross section of individuals, 
schools, firms, cities, or whatever, we have two years of data; call these t = 1 and t = 2. These years 
need not be adjacent, but £ = 1 corresponds to the earlier year. For example, the file CRIME2 con- 
tains data on (among other things) crime and unemployment rates for 46 cities for 1982 and 1987. 
Therefore, t = 1 corresponds to 1982, and t = 2 corresponds to 1987. 

What happens if we use the 1987 cross section and run a simple regression of crmrte on unem? 
We obtain 


crmrte = 128.38 — 4.16 unem 
(20.76) (3.42) 
n = 46, R? = .033. 


If we interpret the estimated equation causally, it implies that an increase in the unemployment rate 
lowers the crime rate. This is certainly not what we expect. The coefficient on unem is not statistically 
significant at standard significance levels: at best, we have found no link between crime and unem- 
ployment rates. 

As we have emphasized throughout this text, this simple regression equation likely suffers from 
omitted variable problems. One possible solution is to try to control for more factors, such as age 
distribution, gender distribution, education levels, law enforcement efforts, and so on, in a multiple 
regression analysis. But many factors might be hard to control for. In Chapter 9, we showed how 
including the crmrte from a previous year—in this case, 1982—can help to control for the fact that 
different cities have historically different crime rates. This is one way to use two years of data for 
estimating a causal effect. 

An alternative way to use panel data is to view the unobserved factors affecting the dependent 
variable as consisting of two types: those that are constant and those that vary over time. Letting 7 
denote the cross-sectional unit and f the time period, we can write a model with a single observed 
explanatory variable as 


Ya = Bo +602, + Bixi ta; + Uj, t =1, 2. [13.20] 


In the notation y;, i denotes the person, firm, city, and so on, and ¢ denotes the time period. The vari- 
able d2, is a dummy variable that equals zero when t = 1 and one when ¢ = 2; it does not change 
across i, which is why it has no i subscript. Therefore, the intercept for t = 1 is Bo, and the intercept 
for t = 2 is By + do. Just as in using independently pooled cross sections, allowing the intercept 
to change over time is important in most applications. In the crime example, secular trends in the 
United States will cause crime rates in all U.S. cities to change, perhaps markedly, over a five-year 
period. 

The variable a; captures all unobserved, time-constant factors that affect y,,. (The fact that a; 
has no f subscript tells us that it does not change over time.) Generically, a; is called an unobserved 
effect. It is also common in applied work to find a; referred to as a fixed effect, which helps us to 
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remember that a; is fixed over time. The model in (13.20) is called an unobserved effects model or 
a fixed effects model. In applications, you might see a; referred to as unobserved heterogeneity as 
well (or individual heterogeneity, firm heterogeneity, city heterogeneity, and so on). 

The error u; is often called the idiosyncratic error or time-varying error, because it represents 
unobserved factors that change over time and affect y,,. These are very much like the errors in a 
straight time series regression equation. 

A simple unobserved effects model for city crime rates for 1982 and 1987 is 


crmrte;, = By + 59d87, + Byunem;, + a; + Ulin [13.21] 


where d87 is a dummy variable for 1987. Because i denotes different cities, we call a; an unobserved 
city effect or a city fixed effect: it represents all factors affecting city crime rates that do not change 
over time. Geographical features, such as the city’s location in the United States, are included in a;. 
Many other factors may not be exactly constant, but they might be roughly constant over a five-year 
period. These might include certain demographic features of the population (age, race, and educa- 
tion). Different cities may have their own methods for reporting crimes, and the people living in the 
cities might have different attitudes toward crime; these are typically slow to change. For historical 
reasons, cities can have very different crime rates, and historical factors are effectively captured by 
the unobserved effect a;. 

How should we estimate the parameter of interest, 64, given two years of panel data? One pos- 
sibility is just to pool the two years and use OLS, essentially as in Section 13-1. This method has two 
drawbacks. The most important of these is that, in order for pooled OLS to produce a consistent esti- 
mator of B,, we would have to assume that the unobserved effect, a;, is uncorrelated with x;,. We can 
easily see this by writing (13.20) as 


Vir = Bo + 5pd2, + Bixi + Vin t= 1,2, [13.22] 


r where v, = a; + uj, is often called the composite 
e 
ms GOING FURTHER 13.3 error. From what we know about OLS, we must 
Suppose that a, Un, and Up have zero | assume that v, is uncorrelated with x;, where t = 1 
means and are pairwise uncorrelated. Show | or 2, for OLS to estimate 8, (and the other param- 
that Cov(Vi1, Va) = Var(a;), so that the com- | eters consistently). This is true whether we use a 
posite errors are positively serially correlated single cross section or pool the two cross sections. 
across time, unless a; = 0. What does this ; eqs : 
: Therefore, even if we assume that the idiosyncratic 
imply about the usual OLS standard errors f : Bo iw 
bee error u; is uncorrelated with x;,, pooled OLS is biased 
from pooled OLS estimation? : : : 
and inconsistent if a; and x, are correlated. The 
resulting bias in pooled OLS is sometimes called 
heterogeneity bias, but it is really just bias caused from omitting a time-constant variable. 
To illustrate what happens, we use the data in CRIME2 to estimate (13.21) by pooled OLS. 
Because there are 46 cities and two years for each city, there are 92 total observations: 


crmrte = 93.42 + 7.94 d87 + 427 unem 
(12.74) (7.98) (1.188) [13.23] 
n = 92, R = 012. 


(When reporting the estimated equation, we usually drop the i and ¢ subscripts.) The coefficient on 
unem, though positive in (13.23), has a very small f statistic. Thus, using pooled OLS on the two 
years has not substantially changed anything from using a single cross section. This is not surprising 
because using pooled OLS does not solve the omitted variables problem. (The standard errors in this 
equation are incorrect because of the serial correlation described in Going Further 13.3, but we ignore 
this as pooled OLS is not the focus here.) 

In most applications, the main reason for collecting panel data is to allow for the unobserved effect, 
a;, to be correlated with the explanatory variables. For example, in the crime equation, we want to allow 
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the unmeasured city factors in a; that affect the crime rate also to be correlated with the unemployment 
rate. It turns out that this is simple to allow: because a; is constant over time, we can difference the data 
across the two years. More precisely, for a cross-sectional observation i, write the two years as 


Yin = (Bo + ôo) + Bixa + a; + up (t = 2) 
Ya = Bo + Bix + a; + uy (t = 1). 


If we subtract the second equation from the first, we obtain 


(Y2 Ya) = ô + Bi(x2 Xa) T (un 7 ua), 


or 
Ay; = 5) + B,Ax; + Au, [13.24] 


where A denotes the change from f = 1 tot = 2. The unobserved effect, a;, does not appear in (13.24): 
it has been “differenced away.” Also, the intercept in (13.24) is actually the change in the intercept 
from t = | tot = 2 as given in equation (13.22). 

Equation (13.24), which we call the first-differenced equation, is very simple. It is just a single 
cross-sectional equation, but each variable is differenced over time. We can analyze (13.24) using the 
methods we developed in Part 1, provided the key assumptions are satisfied. The most important of 
these is that Au; is uncorrelated with Ax,. This assumption holds if the idiosyncratic error at each time t, 
Uj, is uncorrelated with the explanatory variable in both time periods. This is another version of the strict 
exogeneity assumption that we encountered in Chapter 10 for time series models. In particular, this 
assumption rules out the case in which x; is the lagged dependent variable, y; ,_ ,. Unlike in Chapter 10, 
we allow x; to be correlated with unobservables that are constant over time. When we obtain the OLS 
estimator of 6, from (13.24), we call the resulting estimator the first-differenced estimator. 

In the crime example, assuming that Au; and Aunem; are uncorrelated may be reasonable, but it 
can also fail. For example, suppose that law enforcement effort (which is in the idiosyncratic error) 
increases more in cities where the unemployment rate decreases. This can cause negative correlation 
between Au; and Aunem,, which would then lead to bias in the OLS estimator. Naturally, this problem 
can be overcome to some extent by including more factors in the equation, something we will cover 
later. As usual, it is always possible that we have not accounted for enough time-varying factors. 

Another crucial condition is that Ax; must have some variation across i. This qualification fails 
if the explanatory variable does not change over time for any cross-sectional observation, or if it 
changes by the same amount for every observation. This is not an issue in the crime rate example 
because the unemployment rate changes across time for almost all cities. But, if i denotes an indi- 
vidual and x; is a dummy variable for gender, Ax; = 0 for all i; we clearly cannot estimate (13.24) 
by OLS in this case. This actually makes perfectly good sense: because we allow a; to be correlated 
with x;,, we cannot hope to separate the effect of a; on y; from the effect of any variable that does not 
change over time. 

The only other assumption we need to apply to the usual OLS statistics is that (13.24) satisfies 
the homoskedasticity assumption. This is reasonable in many cases, and, if it does not hold, we know 
how to test and correct for heteroskedasticity using the methods in Chapter 8. It is sometimes fair 
to assume that (13.24) fulfills all of the classical linear model assumptions. The OLS estimators are 
unbiased and all statistical inference is exact in such cases. 

When we estimate (13.24) for the crime rate example, we get 


Ncrmrte = 15.40 + 2.22 Aunem 
(4.70) (.88) [13.25] 
n = 46, R = .127, 


which now gives a positive, statistically significant relationship between the crime and unemployment 
rates. Thus, differencing to eliminate time-constant effects makes a big difference in this example. The 


442 PART3 Advanced Topics 


intercept in (13.25) also reveals something interesting. Even if Aunem = 0, we predict an increase 
in the crime rate (crimes per 1,000 people) of 15.40. This reflects a secular increase in crime rates 
throughout the United States from 1982 to 1987. 

Even if we do not begin with the unobserved effects model (13.20), using differences across time 
makes intuitive sense. Rather than estimating a standard cross-sectional relationship—which may suf- 
fer from omitted variables, thereby making ceteris paribus conclusions difficult—equation (13.24) 
explicitly considers how changes in the explanatory variable over time affect the change in y over the 
same time period. Nevertheless, it is still very useful to have (13.20) in mind: it explicitly shows that 
we can estimate the effect of x; on y;,, holding a; fixed. 

Although differencing two years of panel data is a powerful way to control for unobserved 
effects, it is not without cost. First, panel data sets are harder to collect than a single cross section, 
especially for individuals. We must use a survey and keep track of the individual for a follow-up sur- 
vey. It is often difficult to locate some people for a second survey. For units such as firms, some will 
go bankrupt or merge with other firms. Panel data are much easier to obtain for schools, cities, coun- 
ties, states, and countries. 

Even if we have collected a panel data set, the differencing used to eliminate a; can greatly reduce 
the variation in the explanatory variables. While x; frequently has substantial variation in the cross 
section for each t, Ax; may not have much variation. We know from Chapter 3 that a little variation in 
Ax; can lead to a large standard error for By when estimating (13.24) by OLS. We can combat this by 
using a large cross section, but this is not always possible. Also, using longer differences over time is 
sometimes better than using year-to-year changes. 

As an example, consider the problem of estimating the return to education, now using panel data 
on individuals for two years. The model for person i is 


log(wage;,) = Bo + 5pd2, + Byeduc; + a; + up t= 1,2, 


where a; contains unobserved ability—which is probably correlated with educ;, Again, we allow dif- 
ferent intercepts across time to account for aggregate productivity gains (and inflation, if wage; is in 
nominal terms). Because, by definition, innate ability does not change over time, panel data methods 
seem ideally suited to estimate the return to education. The equation in first differences is 


Alog(wage;) = 6) + B,Aeduc; + Au,;, [13.26] 


and we can estimate this by OLS. The problem is that we are interested in working adults, and for most 
employed individuals, education does not change over time. If only a small fraction of our sample has 
Aeduc; different from zero, it will be difficult to get a precise estimator of B, from (13.26), unless we 
have a rather large sample size. In theory, using a first-differenced equation to estimate the return to 
education is a good idea, but it does not work very well with most currently available panel data sets. 

Adding several explanatory variables causes no difficulties. We begin with the unobserved effects 
model 


Vir = Bo + 8od2, + BiXin + Boring Fo + BiXin + ai + Uin [13.27] 


for t = 1 and 2. This equation looks more complicated than it is because each explanatory variable 
has three subscripts. The first denotes the cross-sectional observation number, the second denotes the 
time period, and the third is just a variable label. 


Sleeping versus Working 


We use the two years of panel data in SLP75_81, from Biddle and Hamermesh (1990), to estimate the 
tradeoff between sleeping and working. In Problem 3 in Chapter 3, we used just the 1975 cross sec- 
tion. The panel data set for 1975 and 1981 has 239 people, which is much smaller than the 1975 cross 
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section that includes over 700 people. An unobserved effects model for total minutes of sleeping per 
week is 


slpnap;, = Bo + d9d81, + B,totwrk;, + B educ, + B3marr;, 
+ Byyngkid;, + Bsgdhlth;, + a; + up t= 1,2. 


The unobserved effect, a;, would be called an unobserved individual effect or an individual fixed 
effect. It is potentially important to allow a; to be correlated with totwrk;,: the same factors (some bio- 
logical) that cause people to sleep more or less (captured in a;) are likely correlated with the amount 
of time spent working. Some people just have more energy, and this causes them to sleep less and 
work more. The variable educ is years of education, marr is a marriage dummy variable, yngkid is a 
dummy variable indicating the presence of a small child, and gdhith is a “good health” dummy vari- 
able. Notice that we do not include gender or race (as we did in the cross-sectional analysis), because 
these do not change over time; they are part of a;. Our primary interest is in 64. 
Differencing across the two years gives the estimable equation 


Aslpnap; = 59) + B,Atotwrk, + B,Aeduc; + B,Amarr; 
+ B,Ayngkid, + Bs;Agdhlth, + Au;. 


Assuming that the change in the idiosyncratic error, Au;, is uncorrelated with the changes in all 
explanatory variables, we can get consistent estimators using OLS. This gives 


Asipnap = —92.63 — .227 Atotwrk — .024 Aeduc 


(45.87) (.036) (48.759) 
+ 104.21 Amarr + 94.67 Ayngkid + 87.58 Agdhlth [13.28] 
(92.86) (87.65) (76.60) 


n = 239, R? = .150. 


The coefficient on Atotwrk indicates a tradeoff between sleeping and working: holding other factors 
fixed, one more hour of work is associated with .227(60) = 13.62 fewer minutes of sleeping. The t 
statistic (—6.31) is very significant. No other estimates, except the intercept, are statistically different 
from zero. The F test for joint significance of all variables except Atotwrk gives p-value = .49, which 
means they are jointly insignificant at any reasonable significance level and could be dropped from 
the equation. 

The standard error on Aeduc is especially large relative to the estimate. This is the phenomenon 
described earlier for the wage equation. In the sample of 239 people, 183 (76.6%) have no change in 
education over the six-year period; 90% of the people have a change in education of at most one year. 
As reflected by the extremely large standard error of Bo, there is not nearly enough variation in educa- 
tion to estimate B, with any precision. Anyway, Bo is practically very small. 


Panel data can also be used to estimate finite distributed lag models. Even if we specify the equa- 
tion for only two years, we need to collect more years of data to obtain the lagged explanatory vari- 
ables. The following is a simple example. 


Distributed Lag of Crime Rate on Clear-Up Rate 


Eide (1994) uses panel data from police districts in Norway to estimate a distributed lag model for 
crime rates. The single explanatory variable is the “clear-up percentage” (clrprc)—the percentage 
of crimes that led to a conviction. The crime rate data are from the years 1972 and 1978. Following 
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Eide, we lag c/rprc for one and two years: it is likely that past clear-up rates have a deterrent effect on 
current crime. This leads to the following unobserved effects model for the two years: 


log(crime;,) = Bo + 59d78, + Biclrprc; -1 + Boclrprc;,;—-. + a, + Uy. 
When we difference the equation and estimate it using the data in CRIME3, we get 
Alog(crime) = .086 — .0040 Acirprc_, — .0132 Aclrprc_, 


(.064) (.0047) (.0052) [13.29] 
n = 53, RÈ = .193, R = .161. 


The second lag is negative and statistically significant, which implies that a higher clear-up percent- 
age two years ago would deter crime this year. In particular, a 10 percentage point increase in clrprc 
two years ago would lead to an estimated 13.2% drop in the crime rate this year. This suggests that 
using more resources for solving crimes and obtaining convictions can reduce crime in the future. 


13-3a Organizing Panel Data 


In using panel data in an econometric study, it is important to know how the data should be stored. We 
must be careful to arrange the data so that the different time periods for the same cross-sectional unit 
(person, firm, city, and so on) are easily linked. For concreteness, suppose that the data set is on cities 
for two different years. For most purposes, the best way to enter the data is to have two records for 
each city, one for each year: the first record for each city corresponds to the early year, and the second 
record is for the later year. These two records should be adjacent. Therefore, a data set for 100 cities 
and two years will contain 200 records. The first two records are for the first city in the sample, the 
next two records are for the second city, and so on. (See Table 1.5 in Chapter | for an example.) This 
makes it easy to construct the differences to store these in the second record for each city and to do a 
pooled cross-sectional analysis, which can be compared with the differencing estimation. 

Most of the two-period panel data sets accompanying this text are stored in this way (for exam- 
ple, CRIME2, CRIME3, GPA3, LOWBRTH, and RENTAL). We use a direct extension of this scheme 
for panel data sets with more than two time periods. 

A second way of organizing two periods of panel data is to have only one record per cross- 
sectional unit. This requires two entries for each variable, one for each time period. The panel data in 
SLP75_81 are organized in this way. Each individual has data on the variables slpnap75, slpnap81, 
totwrk75, totwrk&1, and so on. Creating the differences from 1975 to 1981 is easy. Other panel data 
sets with this structure are TRAFFIC! and VOTE2. Putting the data in one record, however, does not 
allow a pooled OLS analysis using the two time periods on the original data. Also, this organizational 
method does not work for panel data sets with more than two time periods, a case we will consider in 
Section 13-5. 


13-4 Policy Analysis with Two-Period Panel Data 


Panel data sets are very useful for policy analysis and, in particular, program evaluation. In the sim- 
plest program evaluation setup, a sample of individuals, firms, cities, and so on, is obtained in the first 
time period. Some of these units, those in the treatment group, then take part in a particular program 
in a later time period; the ones that do not are the control group. This is similar to the natural experi- 
ment literature discussed earlier, with one important difference: the same cross-sectional units appear 
in each time period. 

As an example, suppose we wish to evaluate the effect of a Michigan job training program on 
worker productivity of manufacturing firms (see also Computer Exercise C3 in Chapter 9). Let scrap; 
denote the scrap rate of firm i during year t (the number of items, per 100, that must be scrapped due 
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to defects). Let grant, be a binary indicator equal to one if firm i in year f received a job training 
grant. For the years 1987 and 1988, the model is 


scrap, = Bo + Sov88, + Bygrant, + a; + up t= 1,2, [13.30] 


where y8&, is a dummy variable for 1988 and a; is the unobserved firm effect or the firm fixed effect. 
The unobserved effect contains such factors as average employee ability, capital, and managerial 
skill; these are roughly constant over a two-year period. We are concerned about a; being systemati- 
cally related to whether a firm receives a grant. For example, administrators of the program might 
give priority to firms whose workers have lower skills. Or, the opposite problem could occur: to make 
the job training program appear effective, administrators may give the grants to employers with more 
productive workers. Actually, in this particular program, grants were awarded on a first-come, first- 
served basis. But whether a firm applied early for a grant could be correlated with worker productiv- 
ity. In that case, an analysis using a single cross section or just a pooling of the cross sections will 
produce biased and inconsistent estimators. 
Differencing to remove a; gives 


Ascrap; = 6) + B,Agrant; + Au;. [13.31] 


Therefore, we simply regress the change in the scrap rate on the change in the grant indicator. Because 
no firms received grants in 1987, grantą = 0 for all i, and so Agrant; = grantp — grantą = grant, 
which simply indicates whether the firm received a grant in 1988. However, it is generally important 
to difference all variables (dummy variables included) because this is necessary for removing a; in the 
unobserved effects model (13.30). 

Estimating the first-differenced equation using the data in JTRAIN gives 


Ascrap = —.564 — .739 Agrant 
(.405) (.683) 
n = 54, R? = .022. 
Therefore, we estimate that having a job training grant lowered the scrap rate on average by —.739. 


But the estimate is not statistically different from zero. 
We get stronger results by using log(scrap) and estimating the percentage effect: 


Aiog(scrap) = —.057 — .317Agrant 
(.097) (.164) 
54, R? = .067. 


n 


Having a job training grant is estimated to lower the scrap rate by about 27.2%. [We obtain this 
estimate from equation (7.10): exp(—.317) — 1 =~ —.272.] The f statistic is about — 1.93, which is 
marginally significant. By contrast, using pooled OLS of log(scrap) on y88 and grant gives Bı = .057 
(standard error = .431). Thus, we find no significant relationship between the scrap rate and the job 
training grant. Because this differs so much from the first-difference estimates, it suggests that firms 
that have lower-ability workers are more likely to receive a grant. 

It is useful to study the program evaluation model more generally. Let y; denote an outcome varia- 
ble and let prog; be a program participation dummy variable. The simplest unobserved effects model is 


Ya = Bo + ôod2, + Byprog;, + a; + uj. [13.32] 


If program participation only occurred in the second period, then the OLS estimator of 6, in the dif- 
ferenced equation has a very simple representation: 


Bi = DV ioe g AY onirat: [1 3.33] 
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That is, we compute the average change in y over the two time periods for the treatment and con- 
trol groups. Then, B, is the difference of these. This is the panel data version of the difference-in- 
differences estimator in equation (13.11) for two pooled cross sections. With panel data, we have 
a potentially important advantage: we can difference y across time for the same cross-sectional 
units. This allows us to control for person-, firm-, or city-specific effects, as the model in (13.32) 
makes clear. 

If program participation takes place in both periods, B ı cannot be written as in (13.33), but we 
interpret it in the same way: it is the change in the average value of y due to program participation. 

Controlling for time-varying factors does not change anything of significance. We simply differ- 
ence those variables and include them along with Aprog. This allows us to control for time-varying 
variables that might be correlated with program designation. 

The same differencing method works for analyzing the effects of any policy that varies across 
city or state. The following is a simple example. 


Effect of Drunk Driving Laws on Traffic Fatalities 


Many states in the United States have adopted different policies in an attempt to curb drunk driv- 
ing. Two types of laws that we will study here are open container laws—which make it illegal for 
passengers to have open containers of alcoholic beverages, and administrative per se laws—which 
allow courts to suspend licenses after a driver is arrested for drunk driving but before the driver is 
convicted. One possible analysis is to use a single cross section of states to regress driving fatalities 
(or those related to drunk driving) on dummy variable indicators for whether each law is present. This 
is unlikely to work well because states decide, through legislative processes, whether they need such 
laws. Therefore, the presence of laws is likely to be related to the average drunk driving fatalities in 
recent years. A more convincing analysis uses panel data over a time period during which some states 
adopted new laws (and some states may have repealed existing laws). The file TRAFFIC1 contains 
data for 1985 and 1990 for all 50 states and the District of Columbia. The dependent variable is the 
number of traffic deaths per 100 million miles driven (dthrte). In 1985, 19 states had open container 
laws, while 22 states had such laws in 1990. In 1985, 21 states had per se laws; the number had grown 
to 29 by 1990. 
Using OLS after first differencing gives 


Mdthrte = —.497 — .420 Aopen — .151Aadmn 
(.052) (.206) (.117) [13.34] 
n = 51, R? = 119. 


The estimates suggest that adopting an open con- 
tainer law lowered the traffic fatality rate by .42, a 
In Example 13.7, Aadmn = —1 for the state | nontrivial effect given that the average death rate 
of Washington. Explain what this means. in 1985 was 2.7 with a standard deviation of about 
.6. The estimate is statistically significant at the 5% 
level against a two-sided alternative. The administra- 
tive per se law has a smaller effect, and its f statistic is only — 1.29; but the estimate is the sign we 
expect. The intercept in this equation shows that traffic fatalities fell substantially for all states over 
the five-year period, whether or not there were any law changes. The states that adopted an open con- 
tainer law over this period saw a further drop, on average, in fatality rates. 

Other laws might also affect traffic fatalities, such as seat belt laws, motorcycle helmet laws, and 
maximum speed limits. In addition, we might want to control for age and gender distributions, as well 
as measures of how influential an organization such as Mothers Against Drunk Driving is in each 
state. 


E _ GOING FURTHER 13.4 
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13-5 Differencing with More Than Two Time Periods 


We can also use differencing with more than two time periods. For illustration, suppose we have N 
individuals and T = 3 time periods for each individual. A general unobserved effects model is 


Ya = ôi + 6,d2, + 83d3, + Bitia Fo + Brig +; + lp [13.35] 


for t = 1, 2, and 3. (The total number of observations is therefore 3N.) Notice that we now include 
two time-period dummies in addition to the intercept. It is a good idea to allow a separate intercept 
for each time period, especially when we have a small number of them. The base period, as always, 
is t = 1. The intercept for the second time period is 6, + 6,, and so on. We are primarily interested 
in B,, Bo,..., By If the unobserved effect a; is correlated with any of the explanatory variables, then 
using pooled OLS on the three years of data results in biased and inconsistent estimates. 

The key assumption is that the idiosyncratic errors are uncorrelated with the explanatory variable 
in each time period: 


Cov(x;,, Ui) = 0, for all, t, s, and j. [13.36] 


That is, the explanatory variables are strictly exogenous after we take out the unobserved effect, a;. 
(The strict exogeneity assumption stated in terms of a zero conditional expectation is given in the 
chapter appendix.) Assumption (13.36) rules out cases in which future explanatory variables react to 
current changes in the idiosyncratic errors, as must be the case if xj is a lagged dependent variable. If 
we have omitted an important time-varying variable, then (13.36) is generally violated. Measurement 
error in one or more explanatory variables can cause (13.36) to be false, just as in Chapter 9. In 
Chapters 15 and 16, we will discuss what can be done in such cases. 

If a; is correlated with x; then x; will be correlated with the composite error, vy = a; + Ui, 
under (13.36). We can eliminate a; by differencing adjacent periods. In the T = 3 case, we subtract 
time period one from time period two and time period two from time period three. This gives 


Ay; = 8,Ad2, + 6;Ad3, + BAX + + BAX + Attys [13.37] 


for t = 2 and 3. We do not have a differenced equation for t = 1 because there is nothing to subtract 
from the t = 1 equation. Now, (13.37) represents two time periods for each individual in the sam- 
ple. If this equation satisfies the classical linear model assumptions, then pooled OLS gives unbiased 
estimators, and the usual t and F statistics are valid for hypothesis. We can also appeal to asymptotic 
results. The important requirement for OLS to be consistent is that Au; is uncorrelated with Ax; for 
all j and ¢ = 2 and 3. This is the natural extension from the two time period case. 

Notice how (13.37) contains the differences in the year dummies, d2, and d3,. For t = 2, Ad2, = 1 
and Ad3, = 0; for t = 3, Ad2, = —1 and Ad3, = 1. The intercept in (13.37) has been differenced 
away. This is inconvenient for the purpose of computing an R-squared, but there are two simple rem- 
edies. First, some regression packages allow you to compute the total sum of squares (SST) as if there 
is a constant, and this provides a better goodness-of-feet measure for explaining Ay,,. Second, one can 
estimate a simple transformation of the equation that includes an intercept: 


Ay; = A + aj,d3,+ BiAxa +++ + BAX + Aup t = 2, 3. 


The estimates of the 6; will not change, and now the R-squared is properly computed. One reason for 
estimating the equation in (13.37) is that we can compare the estimates directly with using pooled 
OLS on the levels and with methods that we cover in Chapter 14. 

With more than three time periods, things are similar. If we have the same T time periods for each 
of N cross-sectional units, we say that the data set is a balanced panel: we have the same time periods 
for all individuals, firms, cities, and so on. When T is small relative to N, we should include a dummy 
variable for each time period to account for secular changes that are not being modeled. Therefore, 
after first differencing, the equation looks like 


Ay, = 8,Ad2, + 5,Ad3, + -+ + SpAdT, + Bi Ary ++ BAX + Aun t= 2,3,...,7. [13.38] 
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where we have T — | time periods on each unit i for the first-differenced equation. The total number 
of observations is N(T — 1). To obtain a useful R-squared measure, we can instead estimate an inter- 
cept and include the time dummies dz, d4» ... , dT, in place of the differences in the dummies. 

It is simple to estimate (13.38) by pooled OLS, provided the observations have been properly 
organized and the differencing carefully done. To facilitate first differencing, the data file should con- 
sist of NT records. The first T records are for the first cross-sectional observation, arranged chronolog- 
ically; the second T records are for the second cross-sectional observations, arranged chronologically; 
and so on. Then, we compute the differences, with the change from ¢ — 1 to ¢ stored in the time t 
record. Therefore, the differences for t = 1 should be missing values for all N cross-sectional obser- 
vations. Without doing this, you run the risk of using bogus observations in the regression analysis. 
An invalid observation is created when the last observation for, say, person i — 1 is subtracted from 
the first observation for person i. If you do the regression on the differenced data, and NT or NT — 1 
observations are reported, then you forgot to set the t = 1 observations as missing. 

When using more than two time periods, we must assume that Au; is uncorrelated over time for 
the usual standard errors and test statistics to be valid. This assumption is sometimes reasonable, but 
it does not follow if we assume that the original idiosyncratic errors, u;,, are uncorrelated over time 
(an assumption we will use in Chapter 14). In fact, if we assume the u, are serially uncorrelated with 
constant variance, then the correlation between Au; and Au; ,; can be shown to be —.5. If u; follows 
a stable AR(1) model, then Au, will be serially correlated. Only when u; follows a random walk will 
Au, be serially uncorrelated. 

It is easy to test for serial correlation in the first-differenced equation. Let r; = Au; denote the 
first difference of the original error. If r, follows the AR(1) model r, = pr; -1 + ep then we can 
easily test Hy: p = 0. First, we estimate (13.38) by pooled OLS and obtain the residuals, 7,,. 

Then, we run a simple pooled OLS regression of 7;, on 7, ,-,, t = 3,...,7,i = 1,...,N, and 
compute a standard ¢ test for the coefficient on 7; ,_,. (Or we can make the ż statistic robust to heter- 
oskedasticity.) The coefficient p on Î; ;_; is a consistent estimator of p. Because we are using the lagged 
residual, we lose another time period. For example, if we started with T = 3, the differenced equation 
has two time periods, and the test for serial correlation is just a cross-sectional regression of the residuals 
from the third time period on the residuals from the second time period. We will give an example later. 

If we detect serial correlation—and even if we do not bother to test for serial correlation—it is pos- 
sible to adjust the standard errors to allow unrestricted forms of serial correlation and heteroskedasticity. 
Such methods, which fall under the topic of cluster-robust standard errors, are described in nontechni- 
cal terms in the appendix to this chapter, and a formal treatment is in Wooldridge (2010, Chapter 10). The 
standard approach assumes the observations are independent across i but, with a moderately large N size 
and not “too large” T, they allow any kind of serial correlation pattern (along with heteroskedasticity). 
Less common these days, but still useful, is to cor- 
GOING FURTHER 13.5 rect for the presence of AR(1) serial correlation in r; 
by using feasible GLS. Essentially, within each cross- 
ietdiiaenezd esimetor io be Hlesed end sectional observation, we would use the Prais-Winsten 
inconsistent? Why is serial correlation a transformation based on 7;, described in the previ- 
concern? ous paragraph. (We clearly prefer Prais-Winsten to 
Cochrane-Orcutt here, as dropping the first time period 
would now mean losing N cross-sectional observa- 
tions.) Unfortunately, commands that perform AR(1) corrections for time series regressions generally will 
not work when applied to time series data. Standard Prais-Winsten methods will treat the observations as 
if they followed an AR(1) process across i and t, this makes no sense, as we are assuming the observations 
are independent across i. Wooldridge (2010, Chapter 10) discusses how one can use GLS methods in first 
differenced equations, and some software packages have special commands that perform the estimation. 

If there is no serial correlation in the errors, the usual methods for dealing with heteroskedasticity 
are valid. We can use the usual heteroskedasticity-robust standard errors as the simplest fix. Also, we 
can use the Breusch-Pagan and White tests for heteroskedasticity from Chapter 8. 


Does serial correlation in Au; cause the 
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Differencing more than two years of panel data is very useful for policy analysis, as shown by the 
following example. 


EXAMPLE 13.8 Effect of Enterprise Zones on Unemployment Claims 


Papke (1994) studied the effect of the Indiana enterprise zone (EZ) program on unemployment claims. 
She analyzed 22 cities in Indiana over the period from 1980 to 1988. Six enterprise zones were desig- 
nated in 1984, and four more were assigned in 1985. Twelve of the cities in the sample did not receive 
an enterprise zone over this period; they served as the control group. 

A simple policy evaluation model is 


log(ucls,,) = 0, + Bez, + d; + Ui 


where uclms;,, is the number of unemployment claims filed during year t in city i. The parameter 0, 
just denotes a different intercept for each time period. Generally, unemployment claims were falling 
statewide over this period, and this should be reflected in the different year intercepts. The binary 
variable ez; is equal to one if city i at time t was an enterprise zone; we are interested in B,. The 
unobserved effect a; represents fixed factors that affect the economic climate in city i. Because enter- 
prise zone designation was not determined randomly—enterprise zones are usually economically 
depressed areas—it is likely that ez; and a; are positively correlated (high a; means higher unemploy- 
ment claims, which lead to a higher chance of being given an EZ). Thus, we should difference the 
equation to eliminate a;: 


Alog(ucims,,) = 6,Ad81, + 8,Ad82, + --- + 8gAd88, + B,Aez;, + Au; [13.39] 


The dependent variable in this equation, the change in log(uclms,,), is the approximate annual growth 
rate in unemployment claims from year t — 1 to t. We can estimate this equation for the years 1981 
to 1988 using the data in EZUNEM; the total sample size is 22-8 = 176. The estimate of B, is 
Êi = —.182 (standard error = .078). Therefore, it appears that the presence of an EZ causes about a 
16.6% [exp(—.182) — 1 ~ —.166] fall in unemployment claims. This is an economically large and 
statistically significant effect. 

If we add the lagged OLS residuals to the differenced equation (and lose the year 1981), we get 
p = —.197 (t = —2.44), so there is evidence of some negative serial correlation. When we compute 
a standard error on the ez dummy variable that is robust to both serial correlation, as described in the 
appendix, it is .092, which is above the usual OLS standard error reported above. The cluster-robust t 
statistic is about —1. 98, and so the estimated enterprise zone is less statistically significant. 


County Crime Rates in North Carolina 


Cornwell and Trumbull (1994) used data on 90 counties in North Carolina, for the years 1981 through 
1987, to estimate an unobserved effects model of crime; the data are contained in CRIME4. Here, we 
estimate a simpler version of their model, and we difference the equation over time to eliminate a;, 
the unobserved effect. (Cornwell and Trumbull use a different transformation, which we will cover 
in Chapter 14.) Various factors including geographical location, attitudes toward crime, historical 
records, and reporting conventions might be contained in a;. The crime rate is number of crimes per 
person, prbarr is the estimated probability of arrest, prbconv is the estimated probability of convic- 
tion (given an arrest), prbpris is the probability of serving time in prison (given a conviction), avgsen 
is the average sentence length served, and polpc is the number of police officers per capita. As is 
standard in criminometric studies, we use the logs of all variables to estimate elasticities. We also 
include a full set of year dummies to control for state trends in crime rates. We can use the years 1982 
through 1987 to estimate the differenced equation. The quantities in parentheses are the usual OLS 
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standard errors; the quantities in brackets are standard errors robust to both serial correlation and 
heteroskedasticity: 
Alog(crmrte) = .008 — .100 d83 — .048 d84 — .005 d85 
(.017) (.024) (.024) (.023) 
[.014] [.022] [.020] [.025 
+ .028 d86 + .041 d87 — .327 Alog(prbarr) 
(.024) (.024) (.030) 
[.021 | [.024 ] [.056] 


— .238 Alog(prbconv) — .165 Alog(prbpris) [13.40] 
(.018) (.026) 
[.040] [.046 ] 
— 022 Alog(avgsen) + .398 Alog(polpc) 
(.022) (.027) 
[.026 | [.103] 


n = 540, R? = .433, R = .422. 


The three probability variables—of arrest, conviction, and serving prison time—all have the expected 
sign, and all are statistically significant. For example, a 1% increase in the probability of arrest is pre- 
dicted to lower the crime rate by about .33%. The average sentence variable shows a modest deterrent 
effect, but it is not statistically significant. 

The coefficient on the police per capita variable is somewhat surprising and is a feature of most 
studies that seek to explain crime rates. Interpreted causally, it says that a 1% increase in police per 
capita increases crime rates by about .4%. (The usual f statistic is very large, almost 15.) It is hard to 
believe that having more police officers causes more crime. What is going on here? There are at least 
two possibilities. First, the crime rate variable is calculated from reported crimes. It might be that, when 
there are additional police, more crimes are reported. Second, the police variable might be endogenous 
in the equation for other reasons: counties may enlarge the police force when they expect crime rates to 
increase. In this case, (13.33) cannot be interpreted in a causal fashion. In Chapters 15 and 16, we will 
cover models and estimation methods that can account for this additional form of endogeneity. 

The special case of the White test for heteroskedasticity in Section 8-3 gives F = 75.48 and 
p-value = .0000, so there is strong evidence of heteroskedasticity. (Technically, this test is not valid if 
there is also serial correlation, but it is strongly suggestive.) Testing for AR(1) serial correlation yields 
p = —.233,t = —4.77, so negative serial correlation exists. The standard errors in brackets adjust 
for serial correlation and heteroskedasticity. [See the discussion in the appendix.] No variables lose 
statistical significance, but the f statistics on the significant deterrent variables get notably smaller. For 
example, the ¢ statistic on the probability of conviction variable goes from — 13.22 using the usual OLS 
standard error to —6.10 using the fully robust standard error. Equivalently, the confidence intervals 
constructed using the robust standard errors will, appropriately, be much wider than those based on the 
usual OLS standard errors. 


Naturally, we can apply the Chow test to panel data models estimated by first differencing. As in 
the case of pooled cross sections, we rarely want to test whether the intercepts are constant over time; 
for many reasons, we expect the intercepts to be different. Much more interesting is to test whether 
slope coefficients have changed over time, and we can easily carry out such tests by interacting the 
explanatory variables of interest with time-period dummy variables. Interestingly, while we cannot 
estimate the slopes on variables that do not change over time, we can test whether the partial effects of 
time-constant variables have changed over time. As an illustration, suppose we observe three years of 
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data on a random sample of people working in 2000, 2002, and 2004, and specify the model (for the 
log of wage, /wage), 


lwage;, = Bo + 6,d02, + 5,d04, + B,female; + y,d02,female; 
+ y2d04,female; + ZA + a; + Ui, 


where Z,A is shorthand for other explanatory variables included in the model and their coefficients. 
When we first difference, we eliminate the intercept for 2000, By, and also the gender wage gap 
for 2000, Bı. However, the change in d01, female; is (Ad01,)female;, which does not drop out. 
Consequently, we can estimate how the wage gap has changed in 2002 and 2004 relative to 2000, and 
we can test whether y; = 0, or y) = 0, or both. We might also ask whether the union wage premium 
has changed over time, in which case we include in the model union;,, d02,union;,, and d04,union;,. 
The coefficients on all of these explanatory variables can be estimated because union; would presum- 
ably have some time variation. 

If one tries to estimate a model containing interactions by differencing by hand, it can be a bit 
tricky. For example, in the previous equation with union status, we must simply difference the interac- 
tion terms, d02,union;, and d04,union;,. We cannot compute the proper differences as, say, d02,Aunion,;, 
and d04,Aunion,, or by even replacing d02, and d04, with their first differences. 

As a general comment, it is important to return to the original model and remember that the dif- 
ferencing is used to eliminate a;. It is easiest to use a built-in command that allows first differencing 
as an option in panel data analysis, as discussed in the appendix to this chapter. (We will see some of 
the other options in Chapter 14.) 


13-5a Potential Pitfalls in First Differencing Panel Data 


In this and previous sections, we have argued that differencing panel data over time, in order to elimi- 
nate a time-constant unobserved effect, is a valuable method for obtaining causal effects. Nevertheless, 
differencing is not free of difficulties. We have already discussed potential problems with the method 
when the key explanatory variables do not vary much over time (and the method is useless for explana- 
tory variables that never vary over time). Unfortunately, even when we do have sufficient time vari- 
ation in the x;y, first-differenced (FD) estimation can be subject to serious biases. We have already 
mentioned that strict exogeneity of the regressors is a critical assumption. Unfortunately, as discussed 
in Wooldridge (2010, Section 11-1), having more time periods generally does not reduce the incon- 
sistency in the FD estimator when the regressors are not strictly exogenous (say, if y; ,_, is included 
among the x;,). 

Another important drawback to the FD estimator is that it can be worse than pooled OLS if one 
or more of the explanatory variables is subject to measurement error, especially the classical errors- 
in-variables model discussed in Section 9-3. Differencing a poorly measured regressor reduces its 
variation relative to its correlation with the differenced error caused by classical measurement error, 
resulting in a potentially sizable bias. Solving such problems can be very difficult. See Section 15-8 
and Wooldridge (2010, Chapter 11). 


Summary 


We have studied methods for analyzing independently pooled cross-sectional and panel data sets. Inde- 
pendent cross sections arise when different random samples are obtained in different time periods (usually 
years). OLS using pooled data is the leading method of estimation, and the usual inference procedures 
are available, including corrections for heteroskedasticity. (Serial correlation is not an issue because the 
samples are independent across time.) Because of the time series dimension, we often allow different 
time intercepts. We might also interact time dummies with certain key variables to see how they have 
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changed over time. This is especially important in the policy evaluation literature for natural experiments. 
The difference-in-differences methodology, and its extensions, has proven very useful for studying policy 
interventions. 

Panel data sets are being used more and more in applied work, especially for policy analysis. These 
are data sets in which the same cross-sectional units are followed over time. Panel data sets are most useful 
when controlling for time-constant unobserved features—of people, firms, cities, and so on—which we think 
might be correlated with the explanatory variables in our model. One way to remove the unobserved effect is 
to difference the data in adjacent time periods. Then, a standard OLS analysis on the differences can be used. 
Using two periods of data results in a cross-sectional regression of the differenced data. The usual inference 
procedures are asymptotically valid under homoskedasticity; exact inference is available under normality. 

For more than two time periods, we can use pooled OLS on the differenced data; we lose the first time 
period because of the differencing. In addition to homoskedasticity, we must assume that the differenced 
errors are serially uncorrelated in order to apply the usual ¢ and F statistics. (The chapter appendix contains 
a careful listing of the assumptions.) Naturally, any variable that is constant over time drops out of the anal- 
ysis. The appendix contains a discussion of how one computes standard errors that allow for unrestricted 
forms of serial correlation and heteroskedasticity. 


Key Terms 


Average Treatment Effect First-Differenced Estimator Panel Data 

Balanced Panel Fixed Effect Parallel Trends Assumption 

Clustering Fixed Effects Model Quasi-Experiment 

Cluster-Robust Standard Errors Group-Specific Strict Exogeneity 

Composite Error Heterogeneity Bias Unobserved Effect 

Difference-in-Differences Idiosyncratic Error Unobserved Effects Model 
(DD or DID) Estimator Independently Pooled Cross Unobserved Heterogeneity 

Difference-in-Difference-in- Section Year Dummy Variables 
Differences (DDD) Estimator Longitudinal Data 

First-Differenced Equation Natural Experiment 


Problems 


1 In Example 13.1, assume that the averages of all factors other than educ have remained constant over 
time and that the average level of education is 12.2 for the 1972 sample and 13.3 in the 1984 sample. 
Using the estimates in Table 13.1, find the estimated change in average fertility between 1972 and 
1984. (Be sure to account for the intercept change and the change in average education.) 


2 Using the data in KIELMC, the following equations were estimated using the years 1978 and 1981: 


log(price) = 11.49 — .547 nearinc + .394 y8I-nearinc 
(.26) (.058) (.080) 
321, R? = .220 


= 
II 


and 
CC, 


log(price) = 11.18 + .563 y8I — .403 y81-nearinc 
(.27) (.044) (.067) 
321, R? = .337. 


n 


Compare the estimates on the interaction term y8/-nearinc with those from equation (13.9). Why are 
the estimates so different? 


3 Why can we not use first differences when we have independent cross sections in two years (as 
opposed to panel data)? 
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4 If we think that 8, is positive in (13.14) and that Au; and Aunem, are negatively correlated, what is the 
bias in the OLS estimator of £; in the first-differenced equation? [Hint: Review equation (5.4).] 


5 Suppose that we want to estimate the effect of several variables on annual saving and that we have a 
panel data set on individuals collected on January 31, 1990, and January 31, 1992. If we include a year 
dummy for 1992 and use first differencing, can we also include age in the original model? Explain. 


6 In 1985, neither Florida nor Georgia had laws banning open alcohol containers in vehicle passenger 
compartments. By 1990, Florida had passed such a law, but Georgia had not. 

(i) | Suppose you can collect random samples of the driving-age population in both states, for 1985 
and 1990. Let arrest be a binary variable equal to unity if a person was arrested for drunk driv- 
ing during the year. Without controlling for any other factors, write down a linear probability 
model that allows you to test whether the open container law reduced the probability of being 
arrested for drunk driving. Which coefficient in your model measures the effect of the law? 

(ii) Why might you want to control for other factors in the model? What might some of these factors be? 

(iii) Now, suppose that you can only collect data for 1985 and 1990 at the county level for the two 
states. The dependent variable would be the fraction of licensed drivers arrested for drunk driv- 
ing during the year. How does this data structure differ from the individual-level data described 
in part (1)? What econometric method would you use? 


7 (i) Using the data in INJURY for Kentucky, we find the estimated equation when afchnge is dropped 
from (13.13) is 
ia A 
log(durat) = 1.129 + .253 highearn + .198 afchnge-highearn 
(0.022) (.042) (.052) 
n = 5,626; R? = .021. 
Is it surprising that the estimate on the interaction is fairly close to that in (13.13)? Explain. 
Gi) When afchnge is included but highearn is dropped, the result is 
atc, 
log(durat) = 1.233 — .100 afchnge + .447 afchnge-highearn 
(0.023) (.040) (.050) 
n = 5,626; R° = .016. 


Why is the coefficient on the interaction term now so much larger than in (13.13)? [Hint: In equation 
(13.10), what is the assumption being made about the treatment and control groups if B,; = 0?] 


Computer Exercises 


C1 Use the data in FERTIL] for this exercise. 

(i) In the equation estimated in Example 13.1, test whether living environment at age 16 has an 
effect on fertility. (The base group is large city.) Report the value of the F statistic and the 
p-value. 

(ii) Test whether region of the country at age 16 (South is the base group) has an effect on fertility. 

(iii) Let u be the error term in the population equation. Suppose you think that the variance of u 
changes over time (but not with educ, age, and so on). A model that captures this is 


Ww = Yo + y1y74 + yoy76 +--+ + Vey84 + v. 


Using this model, test for heteroskedasticity in u. (Hint: Your F test should have 6 and 1,122 
degrees of freedom.) 

(iv) Add the interaction terms y74-educ, y76-educ, . . . , y84-educ to the model estimated in 
Table 13.1. Explain what these terms represent. Are they jointly significant? 
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C2 


C3 


C4 


C5 


Use the data in CPS78_85 for this exercise. 

(i) | How do you interpret the coefficient on y85 in equation (13.2)? Does it have an interesting 
interpretation? (Be careful here; you must account for the interaction terms y85-educ and 
y85-female.) 

(ii) Holding other factors fixed, what is the estimated percent increase in nominal wage for a 
male with 12 years of education? Propose a regression to obtain a confidence interval for this 
estimate. [Hint: To get the confidence interval, replace y85-educ with y85 -(educ — 12); refer to 
Example 6.3.] 

(iii) Reestimate equation (13.2) but let all wages be measured in 1978 dollars. In particular, define the 
real wage as rwage = wage for 1978 and as rwage = wage/1.65 for 1985. Now, use log(rwage) 
in place of log(wage) in estimating (13.2). Which coefficients differ from those in equation (13.2)? 

(iv) Explain why the R-squared from your regression in part (iii) is not the same as in equation 
(13.2). (Hint: The residuals, and therefore the sum of squared residuals, from the two 
regressions are identical.) 

(v) Describe how union participation changed from 1978 to 1985. 

(vi) Starting with equation (13.2), test whether the union wage differential changed over time. (This 
should be a simple f test.) 

(vii) Do your findings in parts (v) and (vi) conflict? Explain. 


Use the data in KIELMC for this exercise. 
(i) The variable dist is the distance from each home to the incinerator site, in feet. Consider the model 


log(price) = By + Syy81 + Bylog(dist) + 6,y81-log(dist) + u. 


If building the incinerator reduces the value of homes closer to the site, what is the sign of ô? 
What does it mean if B, > 0? 

(ii) Estimate the model from part (i) and report the results in the usual form. Interpret the coefficient 
on y81-log(dist). What do you conclude? 

(iii) Add age, age’, rooms, baths, log(intst), log(land), and log(area) to the equation. Now, what do 
you conclude about the effect of the incinerator on housing values? 

(iv) Why is the coefficient on log(dist) positive and statistically significant in part (ii) but not in 
part (iii)? What does this say about the controls used in part (iii)? 


Use the data in INJURY for this exercise. 

(i) Using the data for Kentucky, reestimate equation (13.13), adding as explanatory variables male, 
married, and a full set of industry and injury type dummy variables. How does the estimate 
on afchnge-highearn change when these other factors are controlled for? Is the estimate still 
statistically significant? 

(ii) What do you make of the small R-squared from part (1)? Does this mean the equation is useless? 

(iii) Estimate equation (13.13) using the data for Michigan. Compare the estimates on the interaction 
term for Michigan and Kentucky. Is the Michigan estimate statistically significant? What do you 
make of this? 


Use the data in RENTAL for this exercise. The data for the years 1980 and 1990 include rental prices 
and other variables for college towns. The idea is to see whether a stronger presence of students affects 
rental rates. The unobserved effects model is 


log(rent;,) = Bo + ôoy90, + Bilog(pop;,) + Bolog(avginc;,) + Bspctstuy + a; + Uin 


where pop is city population, avginc is average income, and pctstu is student population as a percentage 

of city population (during the school year). 

(i) Estimate the equation by pooled OLS and report the results in standard form. What do you 
make of the estimate on the 1990 dummy variable? What do you get for Bruma 

(ii) Are the standard errors you report in part (i) valid? Explain. 


C6 


C7 


c8 
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Gii) Now, difference the equation and estimate by OLS. Compare your estimate of B persu With that 
from part (ii). Does the relative size of the student population appear to affect rental prices? 
(iv) Obtain the heteroskedasticity-robust standard errors for the first-differenced equation in 
part (iii). Does this change your conclusions? 


Use CRIME3 for this exercise. 

(i) In the model of Example 13.6, test the hypothesis Hp: 8; = 8. (Hint: Define 0, = B, — B, and 
write 8, in terms of 6, and 6. Substitute this into the equation and then rearrange. Do a f test on 04.) 

Gi) If 8, = Bo, show that the differenced equation can be written as 


Alog(crime;) = 5) + 6,Aavgelr, + Au;, 


where 5, = 26, and avgclri = (clrprc; _; + clrprc;,—>)/2 is the average clear-up percentage 
over the previous two years. 

(iii) Estimate the equation from part (ii). Compare the adjusted R-squared with that in (13.22). 
Which model would you finally use? 


Use GPA3 for this exercise. The data set is for 366 student-athletes from a large university for fall and 

spring semesters. [A similar analysis is in Maloney and McCormick (1993), but here we use a true 

panel data set.] Because you have two terms of data for each student, an unobserved effects model is 
appropriate. The primary question of interest is this: Do athletes perform more poorly in school during 
the semester their sport is in season? 

(i) | Use pooled OLS to estimate a model with term GPA (trmgpa) as the dependent variable. The 
explanatory variables are spring, sat, hsperc, female, black, white, frstsem, tothrs, crsgpa, and 
season. Interpret the coefficient on season. Is it statistically significant? 

(ii) Most of the athletes who play their sport only in the fall are football players. Suppose the 
ability levels of football players differ systematically from those of other athletes. If ability is 
not adequately captured by SAT score and high school percentile, explain why the pooled OLS 
estimators will be biased. 

(iii) Now, use the data differenced across the two terms. Which variables drop out? Now, test for an 
in-season effect. 

(iv) Can you think of one or more potentially important, time-varying variables that have been 
omitted from the analysis? 


VOTE2 includes panel data on House of Representatives elections in 1988 and 1990. Only winners 
from 1988 who are also running in 1990 appear in the sample; these are the incumbents. An unob- 
served effects model explaining the share of the incumbent’s vote in terms of expenditures by both 
candidates is 


vote; = Bo + 69d90, + B,log(inexp;,) + Bolog(chexp;,) + Bincshr;, + a; + Ui, 


where incshr;, is the incumbent’s share of total campaign spending (in percentage form). The unob- 
served effect a; contains characteristics of the incumbent—such as “quality”—as well as things about 
the district that are constant. The incumbent’s gender and party are constant over time, so these are 
subsumed in a;. We are interested in the effect of campaign expenditures on election outcomes. 
(i) Difference the given equation across the two years and estimate the differenced equation 
by OLS. Which variables are individually significant at the 5% level against a two-sided 
alternative? 
(ii) In the equation from part (i), test for joint significance of Alog(inexp) and Alog(chexp). Report 
the p-value. 
(iii) Reestimate the equation from part (i) using Aincshr as the only independent variable. Interpret 
the coefficient on Aincshr. For example, if the incumbent’s share of spending increases by 
10 percentage points, how is this predicted to affect the incumbent’s share of the vote? 
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(iv) Redo part (iii), but now use only the pairs that have repeat challengers. [This allows us to 
control for characteristics of the challengers as well, which would be in a,. Levitt (1994) 
conducts a much more extensive analysis. ] 


C9 Use CRIME4 for this exercise. 


(i) Add the logs of each wage variable in the data set and estimate the model by first differencing. 
How does including these variables affect the coefficients on the criminal justice variables in 
Example 13.9? 

(ii) Do the wage variables in (i) all have the expected sign? Are they jointly significant? Explain. 


C10 For this exercise, we use JTRAIN to determine the effect of the job training grant on hours of job train- 


C11 


ing per employee. The basic model for the three years is 


hrsemp;, = Bo + 6,d88, + 6.d89, + By grant, 
+ Bogrant; -1 + Blog (employ;,) + a; + uj, 


(i) Estimate the equation using first differencing. How many firms are used in the estimation? 
How many total observations would be used if each firm had data on all variables (in particular, 
hrsemp) for all three time periods? 

(ii) Interpret the coefficient on grant and comment on its significance. 

(iii) Is it surprising that grant_, is insignificant? Explain. 

(iv) Do larger firms train their employees more or less, on average? How big are the differences in 
training? 


The file MATHPNL contains panel data on school districts in Michigan for the years 1992 through 
1998. It is the district-level analogue of the school-level data used by Papke (2005). The response 
variable of interest in this question is math4, the percentage of fourth graders in a district receiving a 
passing score on a standardized math test. The key explanatory variable is rexpp, which is real expen- 
ditures per pupil in the district. The amounts are in 1997 dollars. The spending variable will appear in 
logarithmic form. 

(i) Consider the static unobserved effects model 


math4;, = 5,y93, + +++ + 898, + B,log(rexpp;,) 
+ Bolog(enrol,,) + Bzlunch; + a; + ti 


where enrol; is total district enrollment and lunch; is the percentage of students in the district 
eligible for the school lunch program. (So lunch; is a pretty good measure of the district-wide 
poverty rate.) Argue that 6,/10 is the percentage point change in math4,, when real per-student 
spending increases by roughly 10%. 

(ii) Use first differencing to estimate the model in part (i). The simplest approach is to allow an 
intercept in the first-differenced equation and to include dummy variables for the years 1994 
through 1998. Interpret the coefficient on the spending variable. 

(iii) Now, add one lag of the spending variable to the model and reestimate using first differencing. 
Note that you lose another year of data, so you are only using changes starting in 1994. Discuss 
the coefficients and significance on the current and lagged spending variables. 

(iv) Obtain heteroskedasticity-robust standard errors for the first-differenced regression in part (iii). 
How do these standard errors compare with those from part (iii) for the spending variables? 

(v) Now, obtain standard errors robust to both heteroskedasticity and serial correlation. What does 
this do to the significance of the lagged spending variable? 

(vi) Verify that the differenced errors r; = Au;, have negative serial correlation by carrying out a test 
of AR(1) serial correlation. 

(vii) Based on a fully robust joint test, does it appear necessary to include the enrollment and lunch 
variables in the model? 


C12 


C13 


C14 


C15 
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Use the data in MURDER for this exercise. 
(i) Using the years 1990 and 1993, estimate the equation 


mrdrte;, = 6) + 6,d93, + Byexec;, + Bounem,, + a; + Uui t = 1,2 


by pooled OLS and report the results in the usual form. Do not worry that the usual OLS 
standard errors are inappropriate because of the presence of a;. Do you estimate a deterrent 
effect of capital punishment? 

(ii) Compute the FD estimates (use only the differences from 1990 to 1993; you should have 
51 observations in the FD regression). Now what do you conclude about a deterrent effect? 

(iii) In the FD regression from part (ii), obtain the residuals, say, ê; Run the Breusch-Pagan 
regression ĉ? on Aexec;, Aunem; and compute the F test for heteroskedasticity. Do the same for 
the special case of the White test [that is, regress ê? on ĵ;, $7, where the fitted values are from 
part (ii)]. What do you conclude about heteroskedasticity in the FD equation? 

(iv) Run the same regression from part (ii), but obtain the heteroskedasticity-robust ¢ statistics. What 
happens? 

(v) Which ¢ statistic on Aexec; do you feel more comfortable relying on, the usual one or the 
heteroskedasticity-robust one? Why? 


Use the data in WAGEPAN for this exercise. 
(i) | Consider the unobserved effects model 


lwage;, = Bo + 6,d81, + ++» + 6,d87, + Byeduc; 
+ y,d81,educ; + ++: + 67d87,educ; + Bunion; + a; + Uin 


where a; is allowed to be correlated with educ; and union;,. Which parameters can you estimate 
using first differencing? 

(ii) Estimate the equation from part (i) by FD, and test the null hypothesis that the return to 
education has not changed over time. 

(iii) Test the hypothesis from part (ii) using a fully robust test, that is, one that allows arbitrary 
heteroskedasticity and serial correlation in the FD errors, Au; Does your conclusion change? 

(iv) Now allow the union differential to change over time (along with education) and estimate the 
equation by FD. What is the estimated union differential in 1980? What about 1987? Is the 
difference statistically significant? 

(v) Test the null hypothesis that the union differential has not changed over time, and discuss your 
results in light of your answer to part (iv). 


Use the data in JTRAIN3 for this exercise. 

(i) Estimate the simple regression model re78 = By + train + u, and report the results in the 
usual form. Based on this regression, does it appear that job training, which took place in 1976 
and 1977, had a positive effect on real labor earnings in 1978? 

(ii) Now use the change in real labor earnings, cre = re78 — re75, as the dependent variable. (We 
need not difference train because we assume there was no job training prior to 1975. That is, if 
we define ctrain = train78 — train75 then ctrain = train78 because train75 = 0.) Now what 
is the estimated effect of training? Discuss how it compares with the estimate in part (1). 

(iii) Find the 95% confidence interval for the training effect using the usual OLS standard error and 
the heteroskedasticity-robust standard error, and describe your findings. 


The data set HAPPINESS contains independently pooled cross sections for the even years from 1994 
through 2006, obtained from the General Social Survey. The dependent variable for this problem is a 
measure of “happiness,” vhappy, which is a binary variable equal to one if the person reports being 
“very happy” (as opposed to just “pretty happy” or “not too happy”). 
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(i) 


(ii) 


(iii) 


(iv) 


(v) 


(vi) 


Which year has the largest number of observations? Which has the smallest? What is the 
percentage of people in the sample reporting they are “very happy”? 

Regress vhappy on all of the year dummies, leaving out y94 so that 1994 is the base year. 
Compute a heteroskedasticity-robust statistic of the null hypothesis that the proportion of very 
happy people has not changed over time. What is the p-value of the test? 

To the regression in part (ii), add the dummy variables occattend and regattend. Interpret their 
coefficients. (Remember, the coefficients are interpreted relative to a base group.) How would 
you summarize the effects of church attendance on happiness? 

Define a variable, say highinc, equal to one if family income is above $25,000. (Unfortunately, 
the same threshold is used in each year, and so inflation is not accounted for. Also, $25,000 

is hardly what one would consider “high income.”) Include highinc, unem10, educ, and teens 
in the regression in part (iii). Is the coefficient on regattend affected much? What about its 
statistical significance? 

Discuss the signs, magnitudes, and statistical significance of the four new variables in part (iv). 
Do the estimates make sense? 

Controlling for the factors in part (iv), do there appear to be differences in happiness by gender 
or race? Justify your answer. 


C16 Use the data in COUNTYMURDERS for this exercise. The data set covers murders and executions 
(capital punishment) for 2,197 counties in the United States. 


(i) 
(ii) 


(iii) 


(iv) 


(v) 
(vi) 


(vii) 


Find the average value of murdrate across all counties and years. What is the standard 
deviation? For what percentage of the sample is murdrate equal to zero? 

How many observations have execs equal to zero? What is the maximum value of execs? Why is 
the average of execs so small? 

Consider the model 


murdrate;, = 0, + Byexecs;, + Bexecs; ,-, + B3percblack;, + Papercmale; 
+ Bspercl019 + Beperc2029 + a; + tin 


where 6, represents a different intercept for each time period, a; is the county fixed effect, and 
uj, is the idiosyncratic error. What do we need to assume about a; and the execution variables in 
order for pooled OLS to consistently estimate the parameters, in particular, 8, and 63? 

Apply OLS to the equation from part (ii) and report the estimates of 6, and £», along with the 
usual pooled OLS standard errors. Do you estimate that executions have a deterrent effect on 
murders? What do you think is happening? 

Even if the pooled OLS estimators are consistent, do you trust the standard errors obtained from 
part (iv)? Explain. 

Now estimate the equation in part (iii) using first differencing to remove a;. What are the new 
estimates of 6, and B,? Are they very different from the estimates from part (iv)? 

Using the estimates from part (vi), can you say there is evidence of a statistically significant 
deterrent effect of capital punishment on the murder rate? If possible, in addition to the 

usual OLS standard errors, use those that are robust to any kind of serial correlation or 
heteroskedasticity in the FD errors. 
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APPENDIX 13A 


13A.1 Assumptions for Pooled OLS Using First Differences 


In this appendix, we provide careful statements of the assumptions for the first-differencing estima- 
tor. Verification of the following claims can be found in Wooldridge (2010, Chapter 10). 


Assumption FD.1 


For each i, the model is 


oe a: af | | 
Ya = ByXin + H B Xing + a; + Uy, t= 1,...,T, 


where the §; are the parameters to estimate and a; is the unobserved effect. 


Assumption FD.2 


We have a random sample from the cross section. 


Assumption FD.3 


Each explanatory variable changes over time (for at least some i), and no perfect linear relationships 
exist among the explanatory variables. 


For the next assumption, it is useful to let X; denote the explanatory variables for all time periods for 


cross-sectional observation i; thus, X, contains x; t = 1,...,7,j = 1,...,k. 


Assumption FD.4 


For each f, the expected value of the idiosyncratic error given the explanatory variables in all time 
periods and the unobserved effect is zero: E(u;|X;, a) = 0. 


When Assumption FD.4 holds, we sometimes say that the x;y; are strictly exogenous conditional on 
the unobserved effect. The idea is that, once we control for a,, there is no correlation between the x; 
and the remaining idiosyncratic error, u;,, for all s and t. 

As stated, Assumption FD.4 is stronger than necessary. We use this form of the assumption 
because it emphasizes that we are interested in the equation 


E(yilX;, a;) = E(yiglXits a;) = Bikin to + BeXin t an 


so that the 6; measure partial effects of the observed explanatory variables holding fixed, or “control- 
ling for,” the unobserved effect, a;. Nevertheless, an important implication of FD.4, and one that is 
sufficient for the unbiasedness of the FD estimator, is E(Au;,|X;) =0,t = 2,..., T. Infact, for consis- 
tency we can simply assume that AX; is uncorrelated with Au, for allt = 2,..., T andj = 1,...,k. 
See Wooldridge (2010, Chapter 10) for further discussion. 

Under these first four assumptions, the first-difference estimators are unbiased. The key assump- 
tion is FD.4, which is strict exogeneity of the explanatory variables. Under these same assumptions, 
we can also show that the FD estimator is consistent with a fixed T and as N —> © (and perhaps more 
generally). 

The next two assumptions ensure that the standard errors and test statistics resulting from 
pooled OLS on the first differences are (asymptotically) valid. 


sj 
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Assumption FD.5 


The variance of the differenced errors, conditional on all explanatory variables, is constant: 
Var(Au;|X;) = 07, f = 2,...,T. 


Assumption FD.6 


For all t # s, the differences in the idiosyncratic errors are uncorrelated (conditional on all 
explanatory variables): Cov(Au,,, Au;,|X;) = 0, t # s. 


Assumption FD.5 ensures that the differenced errors, Au;,, are homoskedastic. Assumption FD.6 
states that the differenced errors are serially uncorrelated, which means that the u; follow a random 
walk across time (see Chapter 11). Under Assumptions FD.1 through FD.6, the FD estimator of the 
B; is the best linear unbiased estimator (conditional on the explanatory variables). 


Assumption FD.7 


Conditional on X, the Au; are independent and identically distributed normal random variables. 


When we add Assumption FD.7, the FD estimators are normally distributed, and the ¢ and F statistics 
from pooled OLS on the differences have exact t and F distributions. Without FD.7, we can rely on 
the usual asymptotic approximations. 


13A.2 Computing Standard Errors Robust to Serial Correlation and 
Heteroskedasticity of Unknown Form 


Because the FD estimator is consistent as N —> œ under Assumptions FD.1 through FD.4, it would 
be very handy to have a simple method of obtaining proper standard errors and test statistics that 
allow for any kind of serial correlation or heteroskedasticity in the FD errors, e; = Au;,. Fortunately, 
provided N is moderately large, and T is not “too large,” fully robust standard errors and test statistics 
are readily available. As mentioned in the text, a detailed treatment is above the level of this text. The 
technical arguments combine the insights described in Chapters 8 and 12, where statistics robust to 
heteroskedasticity and serial correlation are discussed. Actually, there is one important advantage 
with panel data: because we have a (large) cross section, we can allow unrestricted serial correlation 
in the errors {e,,}, provided T is not too large. We can contrast this situation with the Newey-West 
approach in Section 12-5, where the estimated covariances must be downweighted as the observa- 
tions get farther apart in time. Wooldridge (2010, Chapter 10) provides further discussion. 

The general approach to obtaining fully robust standard errors and test statistics in the con- 
text of panel data is known as clustering, and ideas have been borrowed from the cluster sampling 
literature. The idea is that each cross-sectional unit is defined as a cluster of observations over time, 
and arbitrary correlation—serial correlation—and changing variances are allowed within each 
cluster. Because of the relationship to cluster sampling, many econometric software packages have 
options for clustering standard errors and test statistics. The resulting standard errors are often called 
cluster-robust standard errors. As a bonus, such standard errors are also robust to heteroskedasticity 
of unknown form. 

Most commands look something like 


regress cy cd2 cd3 . . . cdT cx1 cx2 ... cxk, noconstant cluster(id), 


where “id” is a variable containing unique identifiers for the cross-sectional units and the “c” before 
each variable denotes “change.” The option “cluster(id)” at the end of the “regress” command 
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tells the software to report all standard errors and test statistics—including f statistics and F-type 
statistics—so that they are valid, in large cross sections, with any kind of serial correlation or het- 
eroskedasticity. The “noconstant” option suppresses the intercept, as it gets eliminated via the dif- 
ferencing. An alternative is to allow a constant and to include the time dummies d3, d4,..., dT in 
levels form. This will not change the estimates on the explanatory variables of interest, just on the 
time effects. 

Some packages have an option that does not require differencing ahead of time, which saves 
some work, is likely to result in fewer mistakes, and also reminds us that the equation of interest is in 
levels, and differencing results in an estimating equation: 


regress D.(y d2 d3...dT x1 x2... xk), noconstant cluster(id) 


where “D.()” denotes differencing everything in parentheses. 

Reporting cluster-robust standard errors and test statistics is now very common in modern 
empirical work with panel data. Often the standard errors will be larger than either the usual OLS 
standard errors or those that correct only for heteroskedasticity, but it is possible for cluster-robust 
standard errors to be smaller, too. In any case, provided N is moderately large and T is not too large, 
the cluster-robust standard errors better reflect the uncertainty in the pooled OLS coefficients. 

There is one important point about clustering to account for serial correlation: it does not 
account for any cross-sectional correlation. In fact, we assume that the draws of units i from the 
population are independent. Removing one potential source of cross-sectional correlation, the unob- 
served effect a; can help. Also, controlling for aggregate time effects through the time dummies 
accounts for cross-sectional correlation caused by common shocks. 
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n this chapter, we focus on two methods for estimating unobserved effects panel data models 
that are at least as common as first differencing. Although these methods are somewhat harder to 
describe and implement, several econometrics packages support them. 

In Section 14-1, we discuss the fixed effects estimator, which, like first differencing, uses a 
transformation to remove the unobserved effect a; prior to estimation. Any time-constant explanatory 
variables are removed along with a;. 

The random effects estimator in Section 14-2 is attractive when we think the unobserved effect 
is uncorrelated with all the explanatory variables. If we have good controls in our equation, we might 
believe that any leftover neglected heterogeneity only induces serial correlation in the composite error 
term, but it does not cause correlation between the composite errors and the explanatory variables. 
Estimation of random effects models by generalized least squares is fairly easy and is routinely done 
by many econometrics packages. 

Section 14-3 introduces the relatively new correlated random effects approach, which provides 
a synthesis of fixed effects and random effects methods, and has been shown to be practically very 
useful. 

In Section 14-4, we show how panel data methods can be applied to other data structures, 
including matched pairs and cluster samples. 
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14-1 Fixed Effects Estimation 


First differencing is just one of the many ways to eliminate the fixed effect, a;. An alternative method, 
which works better under certain assumptions, is called the fixed effects transformation. To see 
what this method involves, consider a model with a single explanatory variable: for each i, 


Ya = BX, + a; + up t= 1,2,...,T. [14.1] 
Now, for each i, average this equation over time. We get 
y; = Bix; + a; + u; [14.2] 


where y; = T~! >)7_,y;,, and so on. Because a; is fixed over time, it appears in both (14.1) and (14.2). 
If we subtract (14.2) from (14.1) for each t, we wind up with 


Ya — Yi = Bile, — X) + uy — Up t= 1,2,...,T, 
or 


Yi = Pike t ün t= 1,2,...,T7, [14.3] 


where Y; = yy — y; is the time-demeaned data on y, and similarly for x, and i. The fixed effects 
transformation is also called the within transformation. The important thing about equation (14.3) is 
that the unobserved effect, a; has disappeared. This suggests that we should estimate (14.3) by pooled 
OLS. A pooled OLS estimator that is based on the time-demeaned variables is called the fixed effects 
estimator or the within estimator. The latter name comes from the fact that OLS on (14.3) uses the 
time variation in y and x within each cross-sectional observation. 

The between estimator is obtained as the OLS estimator on the cross-sectional equation (14.2) 
(where we include an intercept, Bọ): we use the time averages for both y and x and then run a cross- 
sectional regression. We will not study the between estimator in detail because it is biased when aq, is 
correlated with x; (see Problem 2 at the end of this chapter). If we think a; is uncorrelated with x;,, it 
is better to use the random effects estimator, which we cover in Section 14-2. The between estimator 
ignores important information on how the variables change over time. 

Adding more explanatory variables to the equation causes few changes. The original unobserved 
effects model is 


Ya = Pin + Boxing ++ + BexXin + a;i + Uj, t= 1,2,..., 7. [14.4] 


We simply use the time-demeaning on each explanatory variable—including things like time-period 
dummies—and then do a pooled OLS regression using all time-demeaned variables. The general 
time-demeaned equation for each i is 


Ya = Bitin + Bin to + BpXin + Ui, t= 1,2,...,T7, [14.5] 


which we estimate by pooled OLS. 
Under a strict exogeneity assumption on the 
explanatory variables, the fixed effects estimator is 
unbiased: roughly, the idiosyncratic error u; should 
Suppose that in a family savings equation, | be uncorrelated with each explanatory variable across 
for ile Werle) WS VE and Le welet all time periods. (See the chapter appendix for pre- 
kids; denote the number of children in family / : ; ; 
rea t ie ume a ee E Casen cise statements of the assumptions.) The fixed effects 
over this three-year period for most families estimator allows for arbitrary correlation between 
in the sample, what problems might this | 4% and the explanatory variables in any time period, 
cause for estimating the effect that the just as with first differencing. Because of this, 
number of kids has on savings? any explanatory variable that is constant over 
time for all i gets swept away by the fixed effects 
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transformation: ¥, = 0 for all i and ż, if x, is constant across t. Therefore, we cannot include variables 
such as place of birth or a city’s distance from a river. 

The other assumptions needed for a straight OLS analysis to be valid are that the errors u; are 
homoskedastic and serially uncorrelated (across f); see the appendix to this chapter. 

There is one subtle point in determining the degrees of freedom for the fixed effects estimator. 
When we estimate the time-demeaned equation (14.5) by pooled OLS, we have NT total observa- 
tions and k independent variables. [Notice that there is no intercept in (14.5); it is eliminated by the 
fixed effects transformation.] Therefore, we should apparently have NT — k degrees of freedom. This 
calculation is incorrect. For each cross-sectional observation i, we lose one df because of the time- 
demeaning. In other words, for each i, the demeaned errors ii; add up to zero when summed across t, 
so we lose one degree of freedom. (There is no such constraint on the original idiosyncratic errors u;.) 
Therefore, the appropriate degrees of freedom is df = NT — N — k = N(T — 1) — k. Fortunately, 
modern regression packages that have a fixed effects estimation feature properly compute the df. But 
if we have to do the time-demeaning and the estimation by pooled OLS ourselves, we need to correct 
the standard errors and test statistics. 


Effect of Job Training on Firm Scrap Rates 


We use the data for three years, 1987, 1988, and 1989, on the 54 firms that reported scrap rates in 
each year. No firms received grants prior to 1988; in 1988, 19 firms received grants; in 1989, 10 dif- 
ferent firms received grants. Therefore, we must also allow for the possibility that the additional job 
training in 1988 made workers more productive in 1989. This is easily done by including a lagged 
value of the grant indicator. We also include year dummies for 1988 and 1989. The results are given 
in Table 14.1. 

We have reported the results in a way that emphasizes the need to interpret the estimates in 
light of the unobserved effects model, (14.4). We are explicitly controlling for the unobserved, time- 
constant effects in a;. The time-demeaning allows us to estimate the £, but (14.5) is not the best equa- 
tion for interpreting the estimates. 

Interestingly, the estimated lagged effect of the training grant is substantially larger than the con- 
temporaneous effect: job training has an effect at least one year later. Because the dependent variable 
is in logarithmic form, obtaining a grant in 1988 is predicted to lower the firm scrap rate in 1989 by 
about 34.4% [exp(—.422) — 1 ~ —.344]; the coefficient on grant_, is significant at the 5% level 
against a two-sided alternative. The coefficient on grant is significant at the 10% level, and the size 


TABLE 14.1 Fixed Effects Estimation of the Scrap Rate Equation 


Dependent Variable: log(scrap) 
Independent Variables Coefficient (Standard Error) 
d88 —.080 
(.109) 
d89 —.247 
(.133) 
grant = 202 
(.151) 
grant_, — 422 
(.210) 
Observations 162 
Degrees of freedom 104 
A-squared .201 


CHAPTER 14 Advanced Panel Data Methods 465 


of the coefficient is hardly trivial. Notice the df is 
GOING FURTHER 14.2 obtained as N(T — 1) — k = 54(3 — 1) — 4 = 104. 
Under the Michigan program, if a firm The coefficient on d89 indicates that the scrap 
received a grant in one year, it was not | rate was substantially lower in 1989 than in the base 
eligible for a grant the following year. What | year, 1987, even in the absence of job training grants. 
does this imply about the correlation Thus, it is important to allow for these aggregate 
between grant andl grant_47 effects. If we omitted the year dummies, the secular 
increase in worker productivity would be attributed 
to the job training grants. Table 14.1 shows that, even after controlling for aggregate trends in produc- 
tivity, the job training grants had a large estimated effect. 
Finally, it is crucial to allow for the lagged effect in the model. If we omit grant_,, then we are 
assuming that the effect of job training does not last into the next year. The estimate on grant when 
we drop grant_, is —.082 (t = —.65); this is much smaller and statistically insignificant. 


When estimating an unobserved effects model by fixed effects, it is not clear how we should 
compute a goodness-of-fit measure. The R-squared given in Table 14.1 is based on the within trans- 
formation: it is the R-squared obtained from estimating (14.5). Thus, it is interpreted as the amount of 
time variation in the y; that is explained by the time variation in the explanatory variables. Other ways 
of computing R-squared are possible, one of which we discuss later. 

Although time-constant variables cannot be included by themselves in a fixed effects model, 
they can be interacted with variables that change over time and, in particular, with year dummy vari- 
ables. For example, in a wage equation where education is constant over time for each individual 
in our sample, we can interact education with each year dummy to see how the return to education 
has changed over time. But we cannot use fixed effects to estimate the return to education in the 
base period, which means we cannot estimate the return to education in any period; we can only see 
how the return to education in each year differs from that in the base period. Section 14-3 describes 
an approach that allows coefficients on time-constant variables to be estimated while preserving the 
fixed effects nature of the analysis. 

When we include a full set of year dummies—that is, year dummies for all years but the first— 
we cannot estimate the effect of any variable whose change across time is constant. An example is 
years of experience in a panel data set where each person works in every year, so that experience 
always increases by one in each year, for every person in the sample. The presence of a; accounts for 
differences across people in their years of experience in the initial time period. But then the effect of 
a one-year increase in experience cannot be distinguished from the aggregate time effects (because 
experience increases by the same amount for everyone). This would also be true if, in place of sepa- 
rate year dummies, we used a linear time trend: for each person, experience cannot be distinguished 
from a linear trend. 


Has the Return to Education Changed over Time? 


The data in WAGEPAN are from Vella and Verbeek (1998). Each of the 545 men in the sample worked 
in every year from 1980 through 1987. Some variables in the data set change over time: experience, 
marital status, and union status are the three important ones. Other variables do not change: race and 
education are the key examples. If we use fixed effects (or first differencing), we cannot include race, 
education, or experience in the equation. However, we can include interactions of educ with year 
dummies for 1981 through 1987 to test whether the return to education was constant over this time 
period. We use log(wage) as the dependent variable, dummy variables for marital and union status, a 
full set of year dummies, and the interaction terms d&1-educ, d82-educ, . . . , dS7-educ. 

The estimates on these interaction terms are all positive, and they generally get larger for 
more recent years. The largest coefficient of .030 is on d87-educ, with t = 2.48. In other words, 
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the return to education is estimated to be about 3 percentage points larger in 1987 than in the base 
year, 1980. (We do not have an estimate of the return to education in the base year for the reasons 
given earlier.) The other significant interaction term is d86-educ (coefficient = .027, t = 2.23). The 
estimates on the earlier years are smaller and insignificant at the 5% level against a two-sided alter- 
native. If we do a joint F test for significance of all seven interaction terms, we get p-value = .28: 
this gives an example where a set of variables is jointly insignificant even though some variables 
are individually significant. [The df for the F test are 7 and 3,799; the second of these comes from 


N(T — 1) — k = 545(8 — 1) — 16 = 3,799.] Generally, the results are consistent with an increase 
in the return to education over this period. 


14-1a The Dummy Variable Regression 


A traditional view of the fixed effects approach is to assume that the unobserved effect, a,, is a param- 
eter to be estimated for each i. Thus, in equation (14.4), a, is the intercept for person i (or firm i, city i, 
and so on) that is to be estimated along with the 6;. (Clearly, we cannot do this with a single cross 
section: there would be N + k parameters to estimate with only N observations. We need at least two 
time periods.) The way we estimate an intercept for each i is to put in a dummy variable for each 
cross-sectional observation, along with the explanatory variables (and probably dummy variables for 
each time period). This method is usually called the dummy variable regression. Even when N is 
not very large (say, N = 54 as in Example 14.1), this results in many explanatory variables—in most 
cases, too many to explicitly carry out the regression. Thus, the dummy variable method is not very 
practical for panel data sets with many cross-sectional observations. 

Nevertheless, the dummy variable regression has some interesting features. Most importantly, 
it gives us exactly the same estimates of the 6; that we would obtain from the regression on time- 
demeaned data, and the standard errors and other major statistics are identical. Therefore, the fixed 
effects estimator can be obtained by the dummy variable regression. One benefit of the dummy vari- 
able regression is that it properly computes the degrees of freedom directly. This is a minor advantage 
now that many econometrics packages have programmed fixed effects options. 

The R-squared from the dummy variable regression is usually rather high. This occurs because 
we are including a dummy variable for each cross-sectional unit, which explains much of the varia- 
tion in the data. For example, if we estimate the unobserved effects model in Example 13.8 by fixed 
effects using the dummy variable regression (which is possible with N = 22), then R? = .933. We 
should not get too excited about this large R-squared: it is not surprising that we can explain much of 
the variation in unemployment claims using both year and city dummies. Just as in Example 13.8, the 
estimate on the EZ dummy variable is more important than R°. 

The R-squared from the dummy variable regression can be used to compute F tests in the 
usual way, assuming, of course, that the classical linear model assumptions hold (see the chapter 
appendix). In particular, we can test the joint significance of all of the cross-sectional dummies 
(N — 1, since one unit is chosen as the base group). The unrestricted R-squared is obtained from 
the regression with all of the cross-sectional dummies; the restricted R-squared omits these. In the 
vast majority of applications, the dummy variables will be jointly significant. It is important to 
understand that a finding that the â; are statistically different from one another is not the same as 
concluding that pooled OLS—that is, not allowing for unit-specific fixed effects—is inconsistent. 
The presence of heterogeneity does not mean that the heterogeneity is correlated with the explana- 
tory variables, x;j. In fact, suppose that we have only one explanatory variable, x,, and this vari- 
able has been randomly assigned across i and t. Then x;, must be independent of a; and the u,, for 
s=1,2,...,T. Yet it is still possible there is substantial variation in the a; in the population, and this 
will be, with high probability, detected by the joint F test for the G;. 
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Occasionally, the estimated intercepts, say â; are of interest. This is the case if we want to study 
the distribution of the â, across i, or if we want to pick a particular firm or city to see whether its â; 
is above or below the average value in the sample. These estimates are directly available from the 
dummy variable regression, but they are rarely reported by packages that have fixed effects routines 
(for the practical reason that there are so many G,). After fixed effects estimation with N of any size, 
the â; are pretty easy to compute: 


a, =y; — BiXa — °° — BxXpi=l.....N, [14.6] 


where the overbar refers to the time averages and the Ê; are the fixed effects estimates. For example, if 
we have estimated a model of crime while controlling for various time-varying factors, we can obtain 
â; for a city to see whether the unobserved fixed effects that contribute to crime are above or below 
average. 

Some econometrics packages that support fixed effects estimation report an “intercept,” which 
can cause confusion in light of our earlier claim that the time-demeaning eliminates all time-constant 
variables, including an overall intercept. [See equation (14.5).] Reporting an overall intercept in fixed 
effects (FE) estimation arises from viewing the a; as parameters to estimate. Typically, the intercept 
reported is the average across i of the â;. In other words, the overall intercept is actually the average of 
the individual-specific intercepts, which is an unbiased, consistent estimator of a = E(a;). 

In most studies, the Ê; are of interest, and so the time-demeaned equations are used to obtain 
these estimates. Further, it is usually best to view the a; as omitted variables that we control for 
through the within transformation. The sense in which the a; can be estimated is generally weak. 
In fact, even though â; is unbiased (under Assumptions FE.1 through FE.4 in the chapter appendix), 
it is not consistent with a fixed T as N —> œ. The reason is that, as we add each additional cross- 
sectional observation, we add a new a;. No information accumulates on each a; when T is fixed. With 
larger T, we can get better estimates of the a;, but most panel data sets are of the large N and small 
T variety. 


14-1b Fixed Effects or First Differencing? 


So far, setting aside pooled OLS, we have seen two competing methods for estimating unobserved 
effects models. One involves differencing the data, and the other involves time-demeaning. How do 
we know which one to use? 

We can eliminate one case immediately: when T = 2, the FE and FD estimates, as well as all test 
statistics, are identical, and so it does not matter which we use. Of course, the equivalence between 
the FE and FD estimates requires that we estimate the same model in each case. In particular, as we 
discussed in Chapter 13, it is natural to include an intercept in the FD equation; this intercept is actu- 
ally the intercept for the second time period in the original model written for the two time periods. 
Therefore, FE estimation must include a dummy variable for the second time period in order to be 
identical to the FD estimates that include an intercept. 

With T = 2, FD has the advantage of being straightforward to implement in any econometrics or 
statistical package that supports basic data manipulation, and it is easy to compute heteroskedasticity- 
robust statistics after FD estimation (because when T = 2, FD estimation is just a cross-sectional 
regression). 

When T = 3, the FE and FD estimators are not the same. Because both are unbiased under 
Assumptions FE.1 through FE.4, we cannot use unbiasedness as a criterion. Further, both are consist- 
ent (with T fixed as N —> °) under FE.1 through FE.4. For large N and small T, the choice between 
FE and FD hinges on the relative efficiency of the estimators, and this is determined by the serial 
correlation in the idiosyncratic errors, u. (We will assume homoskedasticity of the u;, as efficiency 
comparisons require homoskedastic errors.) 
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When the u, are serially uncorrelated, fixed effects is more efficient than first differencing (and 
the standard errors reported from fixed effects are valid). Because the unobserved effects model is 
typically stated (sometimes only implicitly) with serially uncorrelated idiosyncratic errors, the FE 
estimator is used more than the FD estimator. But we should remember that this assumption can be 
false. In many applications, we can expect the unobserved factors that change over time to be serially 
correlated. If u,, follows a random walk—which means that there is very substantial, positive serial 
correlation—then the difference Au, is serially uncorrelated, and first differencing is better. In many 
cases, the u, exhibit some positive serial correlation, but perhaps not as much as a random walk. 
Then, we cannot easily compare the efficiency of the FE and FD estimators. 

It is difficult to test whether the u, are serially uncorrelated after FE estimation: we can esti- 
mate the time-demeaned errors, ii, but not the u. However, in Section 13-3, we showed how to test 
whether the differenced errors, Au;,, are serially uncorrelated. If this seems to be the case, FD can be 
used. If there is substantial negative serial correlation in the Au,,, FE is probably better. It is often a 
good idea to try both: if the results are not sensitive, so much the better. 

When T is large, and especially when N is not very large (for example, N = 20 and T = 30), 
we must exercise caution in using the fixed effects estimator. Although exact distributional results 
hold for any N and T under the classical fixed effects assumptions, inference can be very sensitive 
to violations of the assumptions when N is small and T is large. In particular, if we are using unit 
root processes—see Chapter 1 1—the spurious regression problem can arise. First differencing has the 
advantage of turning an integrated time series process into a weakly dependent process. Therefore, if 
we apply first differencing, we can appeal to the central limit theorem even in cases where T is larger 
than N. Normality in the idiosyncratic errors is not needed, and heteroskedasticity and serial correla- 
tion can be dealt with as we touched on in Chapter 13. Inference with the fixed effects estimator is 
potentially more sensitive to nonnormality, heteroskedasticity, and serial correlation in the idiosyn- 
cratic errors. 

Like the first difference estimator, the fixed effects estimator can be very sensitive to classical 
measurement error in one or more explanatory variables. However, if each x;, is uncorrelated with u;, 
but the strict exogeneity assumption is otherwise violated—for example, a lagged dependent vari- 
able is included among the regressors or there is feedback between u; and future outcomes of the 
explanatory variable—then the FE estimator likely has substantially less bias than the FD estimator 
(unless T = 2). The important theoretical fact is that the bias in the FD estimator does not depend on 
T, while that for the FE estimator tends to zero at the rate 1/7. See Wooldridge (2010, Section 10-7) 
for details. 

Generally, it is difficult to choose between FE and FD when they give substantively different 
results. It makes sense to report both sets of results and to try to determine why they differ. 


14-1c Fixed Effects with Unbalanced Panels 


Some panel data sets, especially on individuals or firms, have missing years for at least some cross- 
sectional units in the sample. In this case, we call the data set an unbalanced panel. The mechanics 
of fixed effects estimation with an unbalanced panel are not much more difficult than with a balanced 
panel. If T, is the number of time periods for cross-sectional unit i, we simply use these T; observa- 
tions in doing the time-demeaning. The total number of observations is then T) + T, + --- + Ty. As 
in the balanced case, one degree of freedom is lost for every cross-sectional observation due to the 
time-demeaning. Any regression package that does fixed effects makes the appropriate adjustment for 
this loss. The dummy variable regression also goes through in exactly the same way as with a bal- 
anced panel, and the dfis appropriately obtained. 

It is easy to see that units for which we have only a single time period play no role in a fixed 
effects analysis. The time-demeaning for such observations yields all zeros, which are not used in the 
estimation. (If T, is at most two for all i, we can use first differencing: if T, = 1 for any i, we do not 
have two periods to difference.) 
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Dropping units with only one time period does not cause bias or inconsistency, contrary to a criti- 
cism that some make against fixed effects estimation. The T; = 1 units contribute nothing to learning 
about the 8; in an FE environment, and so these units properly drop out. Often, econometrics pack- 
ages report how many cross-sectional units are lost. 

The more difficult issue with an unbalanced panel is determining why the panel is unbalanced. With 
cities and states, for example, data on key variables are sometimes missing for certain years. Provided 
the reason we have missing data for some / is not correlated with the idiosyncratic errors, u,,, the unbal- 
anced panel causes no problems. When we have data on individuals, families, or firms, things are trickier. 
Imagine, for example, that we obtain a random sample of manufacturing firms in 1990, and we are inter- 
ested in testing how unionization affects firm profitability. Ideally, we can use a panel data analysis to 
control for unobserved worker and management characteristics that affect profitability and might also be 
correlated with the fraction of the firm’s workforce that is unionized. If we collect data again in subsequent 
years, some firms may be lost because they have gone out of business or have merged with other com- 
panies. If so, we probably have a nonrandom sample in subsequent time periods. The question is: If we 
apply fixed effects to the unbalanced panel, when will the estimators be unbiased (or at least consistent)? 

If the reason a firm leaves the sample (called attrition) is correlated with the idiosyncratic error— 
those unobserved factors that change over time and affect profits—then the resulting sample section 
problem (see Chapter 9) can cause biased estimators. This is a serious consideration in this example. 
Nevertheless, one useful thing about a fixed effects analysis is that it does allow attrition to be cor- 
related with a,, the unobserved effect. The idea is that, with the initial sampling, some units are more 
likely to drop out of the survey, and this is captured by a;. 

For general missing data patterns, FE has an advantage over FD. Namely, FE maximizes the 
number of observations used in estimation, whereas FD only uses a time period f if time periods t 
and ¢ — 1 both include a full set of observations. Consider an extreme case with a single explanatory 
variable, x;,, with T = 7 the maximum number of time periods that can be observed. If for unit i data 
on x; are missing for all even values of t, FD will use no time periods for this unit: the differences are 
missing for all t = 2,3,..., T. On the other hand, FE will use time periods 1, 3, 5, and 7. If the miss- 
ing data problem is pure attrition, FD and FE both use the maximum number of observations: if all 
variables are observed at time f, they are also observed at t — 1. 


Effect of Job Training on Firm Scrap Rates 


We add two variables to the analysis in Table 14.1: log(sales;,) and log(employ;,), where sales is 
annual firm sales and employ is number of employees. Three of the 54 firms drop out of the analysis 
entirely because they do not have sales or employment data. Five additional observations are lost due 
to missing data on one or both of these variables for some years, leaving us withn = 148. Using fixed 
effects on the unbalanced panel does not change the basic story, although the estimated grant effect 
gets larger: Barans = —.297, torant = —1.89; Boran = —.536, tgrant—1 = —2.389. 


Solving general missing data problems when selection can be correlated with the idiosyncratic errors 
is complicated and beyond the scope of this text. [See, for example, Wooldridge (2010, Chapter 19).] 


14-2 Random Effects Models 


We begin with the same unobserved effects model as before, 


Ya = Bo + BiXin + + BeXin + Ai + lip [14.7] 


where we explicitly include an intercept so that we can make the assumption that the unobserved 
effect, a; has zero mean (without loss of generality). We would usually allow for time dummies 
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among the explanatory variables as well. In using fixed effects or first differencing, the goal is to 
eliminate a; because it is thought to be correlated with one or more of the x;,;. But suppose we think 
a; 1s uncorrelated with each explanatory variable in all time periods. Then, using a transformation to 
eliminate a; results in inefficient estimators. 

Equation (14.7) becomes a random effects model when we assume that the unobserved effect a; 
is uncorrelated with each explanatory variable: 


COV (Xj a) = 0; 4 = L2] = 2k: [14.8] 


In fact, the ideal random effects assumptions include all of the fixed effects assumptions plus the 
additional requirement that a; is independent of all explanatory variables in all time periods. (See the 
chapter appendix for the actual assumptions used.) If we think the unobserved effect a; is correlated 
with any explanatory variables, we should use first differencing or fixed effects. 

Under (14.8) and along with the random effects assumptions, how should we estimate the 6,? It 
is important to see that, if we believe that a; is uncorrelated with the explanatory variables, the 6; can 
be consistently estimated by using a single cross section: there is no need for panel data at all. But 
using a single cross section disregards much useful information in the other time periods. We can also 
use the data in a pooled OLS procedure: just run OLS of y, on the explanatory variables and prob- 
ably the time dummies. This, too, produces consistent estimators of the £, under the random effects 
assumption. But it ignores a key feature of the model. If we define the composite error term as 
Vi = a; + Up then (14.7) can be written as 


Yi = Bo + Biin bo + Bik F Vie [14.9] 


Because a; is in the composite error in each time period, the v; are serially correlated across time. In 
fact, under the random effects assumptions, 


Corr(vj, vis) = 03/03 + ou), t#s, 


where o? = Var(a;) and o? = Var(u;,). This (necessarily) positive serial correlation in the error term 
can be substantial, and, because the usual pooled OLS standard errors ignore this correlation, they 
will be incorrect, as will the usual test statistics. In Chapter 12, we showed how generalized least 
squares can be used to estimate models with autoregressive serial correlation. We can also use GLS to 
solve the serial correlation problem here. For the procedure to have good properties, we should have 
large N and relatively small T. We assume that we have a balanced panel, although the method can be 
extended to unbalanced panels. 

Deriving the GLS transformation that eliminates serial correlation in the errors requires sophisti- 
cated matrix algebra [see, for example, Wooldridge (2010, Chapter 10)]. But the transformation itself 
is simple. Define 


0 = 1 - [o}/(0} + To?) |’, [14.10] 
which is between zero and one. Then, the transformed equation turns out to be 


Ya — OY; = Bo(1 = 0) + By (Xin z Oxi) es [14.11] 
+ Bel Xie = OX) T (vi a 6¥;), 


where the overbar again denotes the time averages. This is a very interesting equation, as it 
involves quasi-demeaned data on each variable. The fixed effects estimator subtracts the time 
averages from the corresponding variable. The random effects transformation subtracts a frac- 
tion of that time average, where the fraction depends on o7, o, and the number of time peri- 


u? 


ods, T. The GLS estimator is simply the pooled OLS estimator of equation (14.11). It is hardly 
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obvious that the errors in (14.11) are serially uncorrelated, but they are. (See Problem 3 at the 
end of this chapter.) 

The transformation in (14.11) allows for explanatory variables that are constant over time, and 
this is one advantage of random effects (RE) over either fixed effects or first differencing. This is 
possible because RE assumes that the unobserved effect is uncorrelated with all explanatory vari- 
ables, whether the explanatory variables are fixed over time or not. Thus, in a wage equation, we can 
include a variable such as education even if it does not change over time. But we are assuming that 
education is uncorrelated with a; which contains ability and family background. In many applica- 
tions, the whole reason for using panel data is to allow the unobserved effect to be correlated with the 
explanatory variables. 

The parameter 6 is never known in practice, but it can always be estimated. There are different 
ways to do this, which ma be a on pooled OLS or fixed effects, for example. Generally, 6 takes 
the form ô = 1 — ula T(G7/G7) |}!”, where G2 is a consistent estimator of o2 and G? is a con- 
sistent estimator of a2. These estimators can be ae on the pooled on a fixed effects residuals. 
One possibility is that o? = [NT(T — 1)⁄2 — D a eee where the »,, 
are the residuals from Sstimating (14.9) by a one Given this, we can estimate o7 by using 
F7 = ô? — G2, where G? is the square of the usual standard error of the regression from pooled OLS. 
[See Wooldridge (2010, Chapter 10) for additional discussion of these estimators. ] 

Many econometrics packages support estimation of random effects models and automatically 
compute some version of 6. The feasible GLS estimator that uses 6 in place of @ is called the random 
effects estimator. Under the random effects assumptions in the chapter appendix, the estimator 
is consistent (not unbiased) and asymptotically normally distributed as N gets large with fixed T. 
The properties of the random effects (RE) estimator with small N and large T are largely unknown, 
although it has certainly been used in such situations. 

Equation (14.11) allows us to relate the RE estimator to both pooled OLS and fixed effects. 
Pooled OLS is obtained when 0 = 0, and FE is obtained when 0 = 1. In practice, the estimate 
6 is never zero or one. But if Ô is close to zero, the RE estimates will be close to the pooled OLS 
estimates. This is the case wen the unobserved effect, di is relatively unimportant (because it has 
small variance relative to o2). It is more common for @? to be large relative to o2, in which case 6 
will be closer to unity. As T Bets large, 6 tends to one, and this makes the RE and FE estimates very 
similar. 

We can gain more insight on the relative merits of random effects versus fixed effects by writ- 
ing the quasi-demeaned error in equation (14.11) as v, — 0v; = (1 — 0)a; + u, — 0u;. This sim- 
ple expression makes it clear that in the transformed equation the unobserved effect is weighted by 
(1 — 0). Although correlation between a; and one or more X; Causes inconsistency in the random 
effects estimation, we see that the correlation is attenuated by the factor (1 — 0). As 0 — 1, the bias 
term goes to zero, as it must because the RE estimator tends to the FE estimator. If 0 is close to zero, 
we are leaving a larger fraction of the unobserved effect in the error term, and, as a consequence, the 
asymptotic bias of the RE estimator will be larger. 

In applications of FE and RE, it is usually informative also to compute the pooled OLS esti- 
mates. Comparing the three sets of estimates can help us determine the nature of the biases caused 
by leaving the unobserved effect, a;, entirely in the error term (as does pooled OLS) or partially in 
the error term (as does the RE transformation). But we must remember that, even if a; is uncorrelated 
with all explanatory variables in all time periods, the pooled OLS standard errors and test statistics 
are generally invalid: they ignore the often substantial serial correlation in the composite errors, 
Vip = A; + Ug AS We mentioned in Chapter 13 (see Example 13.9), it is possible to compute stand- 
ard errors and test statistics that are robust to arbitrary serial correlation (and heteroskedasticity) in 
Vi» and popular statistics packages often allow this option. [See, for example, Wooldridge (2010, 
Chapter 10).] 
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EXAMPLE 14.4 A Wage Equation Using Panel Data 


We again use the data in WAGEPAN to estimate a wage equation for men. We use three methods: 
pooled OLS, random effects, and fixed effects. In the first two methods, we can include educ and race 
dummies (black and hispan), but these drop out of the fixed effects analysis. The time-varying vari- 
ables are exper, exper, union, and married. As we discussed in Section 14-1, exper is dropped in the 
FE analysis (although exper’ remains). Each regression also contains a full set of year dummies. The 
estimation results are in Table 14.2. 


TABLE 14.2 Three Different Estimators of a Wage Equation 


Dependent Variable: log(wage) 

Independent Variables Pooled OLS Random Effects Fixed Effects 

educ .092 
(.005) (011) 

black —.139 —.139 = 
(.024) (.048) 

hispan .016 022 —- 
(.021) (.043) 

exper .067 106 —— 
(.014) (.015) 

exper? —.0024 — 0047 —.0052 
(.0008) (.0007) (.0007) 

married 108 .064 .047 
(.016) (.017) (.018) 

union 182 106 .080 
(.017) (.018) (.019) 


GOING FURTHER 14.3 l The coefficients on educ, black, and hispan are 
similar for the pooled OLS and random effects esti- 
The union premium estimated by fixed | mations. The pooled OLS standard errors are the 
effects is about 10 percentage points usual OLS standard errors, and these underestimate 
the true standard errors because they ignore the 
positive serial correlation; we report them here for 
comparison only. The experience profile is some- 
what different, and both the marriage and union 
premiums fall notably in the random effects estimation. When we eliminate the unobserved effect 
entirely by using fixed effects, the marriage premium falls to about 4.7%, although it is still statisti- 
cally significant. The drop in the marriage premium is consistent with the idea that men who are 
more able—as captured by a higher unobserved effect, a—are more likely to be married. Therefore, 
in the pooled OLS estimation, a large part of the marriage premium reflects the fact that men who 
are married would earn more even if they were not married. The remaining 4.7% has at least two 
possible explanations: (1) marriage really makes men more productive or (2) employers pay mar- 
ried men a premium because marriage is a signal of stability. We cannot distinguish between these 
two hypotheses. 

The estimate of 0 for the random effects estimation is Ô = .643, which helps explain why, on 
the time-varying variables, the RE estimates lie closer to the FE estimates than to the pooled OLS 
estimates. 


lower than the OLS estimate. What does 
this strongly suggest about the correlation 
between union and the unobserved effect? 
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14-2a Random Effects or Pooled OLS? 


Sometimes one sees the choice between RE estimation and pooled OLS based on a test of H: ož = 0. 
The idea that o? = 0 indicates no unobserved effect, in which case we should just use OLS. Breusch 
and Pagan (1980) developed such a test, but, from a modern perspective, its usefulness is somewhat 
limited. Most importantly, the outcome of the test has nothing to say about whether pooled OLS is 
consistent. The presence of a;, indicated by a? > 0, has nothing to do with whether a; is correlated 
with x;,. In fact, both RE and POLS are inconsistent if a; is correlated with the explanatory variables. 

A second shortcoming of the use of the test is that it is effectively a test of positive serial correla- 
tion in the composite error terms v; = a; + up. In the traditional setting where the u; are assumed to 
be serially uncorrelated, any correlation in v; and v; for t # s is assumed to be due to the presence of 
a;. But the u;, could follow, say, an AR(1) process with positive correlation, and that would be picked 
up by the Breusch-Pagan (B-P) test. In any case, the test usually rejects the null hypothesis because in 
situations in which time is involved and we cannot include lagged dependent variables, there is usu- 
ally positive serial correlation in the errors. It is up to us to decide why that is the case. 

The other drawbacks to the Breusch-Pagan test are relatively minor but still against the spirit of 
robust inference. In particular, the test assumes that all unobservables are normally distributed, and it 
also maintains homoskedasticity. Neither of these assumptions is necessary for either RE or POLS to 
be consistent. 

Even though the outcome of the B-P test is rarely surprising or informative, there are good rea- 
sons to prefer RE over POLS. First, as equation (14.11), RE removes the fraction 6 of a; from the 
error term, and so it likely has less bias (inconsistency) than POLS. Second, even if the error structure 
is not as simple as the traditional RE assumptions, the RE estimator can be quite a bit more efficient 
than POLS: RE removes at least some, if not all, of the serial correlation. 


14-2b Random Effects or Fixed Effects? 


Because fixed effects allows arbitrary correlation between a; and the xj, while random effects does 
not, FE is widely thought to be a more convincing tool for estimating ceteris paribus effects. Still, 
random effects is applied in certain situations. Most obviously, if the key explanatory variable is con- 
stant over time, we cannot use FE to estimate its effect on y. For example, in Table 14.2, we must 
rely on the RE (or pooled OLS) estimate of the return to education. Of course, we can only use ran- 
dom effects because we are willing to assume the unobserved effect is uncorrelated with all explana- 
tory variables. Typically, if one uses random effects, as many time-constant controls as possible are 
included among the explanatory variables. (With an FE analysis, it is not necessary to include such 
controls.) RE is preferred to pooled OLS because RE is generally more efficient. 

If our interest is in a time-varying explanatory variable, is there ever a case to use RE rather than 
FE? Yes, but situations in which Cov(Xirjs a;) = 0 should be considered the exception rather than the 
tule. If the key policy variable is set experimentally—say, each year, children are randomly assigned 
to classes of different sizes—then random effects would be appropriate for estimating the effect of 
class size on performance. Unfortunately, in most cases the regressors are themselves outcomes of 
choice processes and likely to be correlated with individual preferences and abilities as captured by a;. 

It is still fairly common to see researchers apply both random effects and fixed effects, and then 
formally test for statistically significant differences in the coefficients on the time-varying explana- 
tory variables. (So, in Table 14.2, these would be the coefficients on exper’, married, and union.) 
Hausman (1978) first proposed such a test, and some econometrics packages routinely compute the 
Hausman test under the full set of random effects assumptions listed in the appendix to this chapter. 
The idea is that one uses the random effects estimates unless the Hausman test rejects (14.8). In 
practice, a failure to reject means either that the RE and FE estimates are sufficiently close so that 
it does not matter which is used, or the sampling variation is so large in the FE estimates that one 
cannot conclude practically significant differences are statistically significant. In the latter case, one 
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is left to wonder whether there is enough information in the data to provide precise estimates of the 
coefficients. A rejection using the Hausman test is taken to mean that the key RE assumption, (14.8), 
is false, and then the FE estimates are used. (Naturally, as in all applications of statistical inference, 
one should distinguish between a practically significant difference and a statistically significant dif- 
ference.) Wooldridge (2010, Chapter 10) contains further discussion. In the next section we discuss an 
alternative, computationally simpler approach to choosing between the RE and FE approaches. 

A final word of caution. In reading empirical work, you may find that some authors decide on 
FE versus RE estimation based on whether the a; are properly viewed as parameters to estimate or 
as random variables. Such considerations are usually wrongheaded. In this chapter, we have treated 
the a; as random variables in the unobserved effects model (14.7), regardless of how we decide to 
estimate the B;. As we have emphasized, the key issue that determines whether we use FE or RE is 
whether we can plausibly assume a; is uncorrelated with all x;;. Nevertheless, in some applications 
of panel data methods, we cannot treat our sample as a random sample from a large population, 
especially when the unit of observation is a large geographical unit (say, states or provinces). Then, it 
often makes sense to think of each a; as a separate intercept to estimate for each cross-sectional unit. 
In this case, we use fixed effects: remember, using FE is mechanically the same as allowing a differ- 
ent intercept for each cross-sectional unit. Fortunately, whether or not we engage in the philosophical 
debate about the nature of a; FE is almost always much more convincing than RE for policy analysis 
using aggregated data. 


14-3 The Correlated Random Effects Approach 


In applications where it makes sense to view the a; (unobserved effects) as being random variables, 
along with the observed variables we draw, there is an alternative to fixed effects that still allows a; 
to be correlated with the observed explanatory variables. To describe the approach, consider again 
the simple model in equation (14.1), with a single, time-varying explanatory variable x;. Rather than 
assume q; is uncorrelated with {x;,: t = 1,2,..., T}—which is the random effects approach—or take 
away time averages to remove a;—the fixed effects approach—we might instead model correlation 
between a; and {x; t = 1,2,..., T}. Because a; is, by definition, constant over time, allowing it to be 
correlated with the average level of the x, has a certain appeal. More specifically, let x, = T~! >) yx, 


be the time average, as before. Suppose we assume the simple linear relationship 
ai= @ + yx, + Tr; [14.12] 
where we assume r; is uncorrelated with each x,. Because x; is a linear function of the x;,, 
Cov(x;, r;) = 0. [14.13] 


Equations (14.12) and (14.13) imply that a; and x; are correlated whenever y # 0. 
The correlated random effects (CRE) approach uses (14.12) in conjunction with (14.1): substi- 
tuting the former in the latter gives 


Ya = BX, + a + yx, + ri + up = a + BX, + yx; + 7 + Uy [14.14] 


Equation (14.14) is interesting because it still has a composite error term, r; + uş, consisting of a 
time-constant unobservable r; and the idiosyncratic shocks, u;,. Importantly, assumption (14.8) holds 
when we replace a; with r;. Also, because u; is assumed to be uncorrelated with x;,, all s and t, uj, is 
also uncorrelated with x;. All of these assumptions add up to random effects estimation of the equation 


Yu = & + Pxy + yx; + T; + Uin [14.15] 


which is like the usual equation underlying RE estimation with the important addition of the time- 
average variable, x;. It is the addition of x; that controls for the correlation between a, and the sequence 
{xx t = 1, 2,..., T}. What is left over, r; is uncorrelated with the x;. 


2! io 


CHAPTER 14 Advanced Panel Data Methods 475 


In most econometrics packages it is easy to compute the unit-specific time averages, x; Assuming 
we have done that for each cross-sectional unit 7, what can we expect to happen if we apply RE to 
equation (14.15)? Notice that estimation of (14.15) gives @crp, Berri and Ẹcrr—the CRE estimators. 
As far as Bere goes, the answer is a bit anticlimactic. It can be shown—see, for example, Wooldridge 
(2010, Chapter 10)—that 


Êcre = Bre, [14.16] 


where Bre denotes the FE estimator from equation (14.3). In other words, adding the time average x; 
and using random effects is the same as subtracting the time averages and using pooled OLS. 

Even though (14.15) is not needed to obtain B re, the equivalence of the CRE and FE estimates of 
B provides a nice interpretation of FE: it controls for the average level, x; when measuring the partial 
effect of x; on yy. As an example, suppose that x; is a tax rate on firm profits in county 7 in year t, 
and y;, is some measure of county-level economic output. By including x;, the average tax rate in the 
county over the T years, we are allowing for systematic differences between historically high-tax and 
low-tax counties—differences that may also affect economic output. 

We can also use equation (14.15) to see why the FE estimators are often much less precise than 
the RE estimators. If we set y = 0 in equation (14.15) then we obtain the usual RE estimator of B, 
Baw This means that correlation between x; and x; has no bearing on the variance of the RE estimator. 
By contrast, we know from multiple regression analysis in Chapter 3 that correlation between x; 
and x,—that is, multicollinearity—can result in a higher variance for B rE: Sometimes the variance is 
much higher, particularly when there is little variation in x; across t, in which case x; and x; tend to be 
highly (positively) correlated. In the limiting case where there is no variation across time for any i, the 
correlation is perfect—and FE fails to provide an estimate of $. 

Apart from providing a synthesis of the FE and RE approaches, are there other reasons to con- 
sider the CRE approach even if it simply delivers the usual FE estimate of 6? Yes, at least two. First, 
the CRE approach provides a simple, formal way of choosing between the FE and RE approaches. In 
using the CRE approach to choose between RE and FE, we must be sure to include any time-constant 


variables that appear in the RE estimation. Generally, suppose we have xj, Xin, . . - 5 Xin aS the time 
varying variables with corresponding time averages Xj, .. . , Xj. Let Zi . . . , Zim be the time-constant 
variables, and let d2,,..., dT, be the time dummies. Then the CRE equation is 


Yı = Oy + d2, + +++ + ardT, + Bixi + + Bytig + YX + + VEX 
+ 81Zi1 ap tte =p Oki + ri + Uit [14.17] 
Le Qe ge 


t 


In the balanced case, note that there is no need to include the time averages of the time dummies, as 
each of these is simply 1/T for every pair (i, t). In other words, we would be adding the same constant, 
1/T, to the equation in several spots. Because we already include an intercept, the time averages of 
d2,,..., dT, are redundant. 

When equation (14.17) is estimated by RE (or even just pooled OLS), we obain 


Benes = Penal = Whe creak [14.18] 


QCRE,t = QFE, t t= ik Sets. T 


The null hypothesis that RE is sufficient—that we do not need the CRE (equivalently, FE approach), is 


Ho: yı y= = y = O, [14.19] 


so there are as many restrictions as there are B,. Using modern software, we can make this test robust 
to serial correlation in {u;} and heteroskedasticity in r; or u;,; see the discussion in the appendix. As in 
any context, especially with a large N, it is possible to obtain a strong statistical rejection of (14.19) 
but where the RE and FE estimates do not differ by much in a practical sense. 
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In implementing the test to choose between RE and FE using the CRE approach, it is critical to 
include the time constant variables zj, . . . , Zim If these are sufficiently good controls, we may not 
have to include the x;—in other words, RE estimation may be sufficient. If we leave the z; out of 
(14.17) then we are omitting important variables. The inclusion of the z,, has no bearing on whether 
we obtain the fixed effects estimated as reported in (14.18): we obtain the FE estimates as along as the 
full set of time averages, x;,, , Xip are included. 

Another beneficial byproduct of the CRE approach is that we obtain coefficients on time constant 
variables: 6 TEE SOn Some researchers complain that in using FE they cannot obtain coefficients on 
time constant vädables; but the CRE approach rectifies that. One must be careful in interepreting the 
ô, because the z;; may be correlated with the original heterogeneity, a;. Nevertheless, we can examing 
the ô, for logical consistency, and often their signs and magnitudes make sense. In any case, as men- 
tioñed above, these variables must be included in testing (14.19). 

One sometimes sees a concern in using the CRE approach with how lags can be handled, or how 
standard functional forms, such as squares and interactions, can be included. The latter is particularly 
simple: the rule is to include the time averages of all time-varying explanatory variables, and that 
includes variables that are squares or interactions among other variables. For example, we can apply 
the CRE approach to the model 


Ya = Q, + BiXin + BXin + B3Xin + BaXinXin + OZ + 6.22 + PX 


Fa; tipt = 1,...T, [14.20] 
where a, denotes different time intercepts. The CRE approach requires us to include the time 
averages Of Xi, Xin, Xin > XaXin » and Xj1Z;,. The latter is simply %;,z;,, while the time averages of 
{XinXioi t = 1,..., T} and {x7,:t = 1,..., Thare, respectively, 


T T 
PO" Daihen! Dia: 
t=1 t=1 
These simply are included along with x;, Xp, and x,,z;; in the equation estimated by random effects. 
The fixed effects estimates of B,, B2, 83, B4, and yw, will be reproduced, and we obtain estimates of ô 
and ô, as well. In order to force the coefficients on xı and x;,. to have meaningful interpretations, we 
probably would demean x;j;,, Xin, and z; (in each case using the sample average across i and ft) before 
obtaining the unit-specific time averages. 

Allowing lags is a bit trickier with the CRE approach and is closely related to the problem of 
unbalanced panels. The key is that the time averages should be computed for all variables using only 
the time periods used in the estimation. For example, our sample starts at t = 1 and we want to esti- 
mate the distributed lag equation 


Yit = A + BoWi + BiWit-1 + BoWiz—2 + Xb + ZO + ai, t= 3, 4, oo ey T [14.21] 


then all time averages are computed using data starting with t = 3; the periods ¢ = | and 2 are ignored. 
This is true even when computing the elements of x,;. The reason is that we want the CRE estimates on 
all time-varying explanatory variables to be the FE estimates, and this is only ensured when all time 
averages are computed using the time periods available for FE estimation. 

Computer Exercise 14 in this chapter illustrates how the CRE approach can be applied to the bal- 
anced panel data set in AIRFARE and how one can test RE versus FE in the CRE framework. 


14-3a Unbalanced Panels 


The correlated random effects approach also can be applied to unbalanced panels, but some care 
is required. In order to obtain an estimator that reproduces the fixed effects estimates on the time- 
varying explanatory variables, one must be careful in constructing the time averages. In particular, for 
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y or any x, a time period contributes to the time average, y; or x,, only if data on all of (Yeo Mes tesa) 
are observed. One way to depict the situation is to define a dummy variable, s;, which equals one 
when a complete set of data on (Y; Xu» ---, Xi) is observed. If any element is missing (including, of 
course, if the entire time period is missing), then s,, = 0. We introduced the notion of a missing data 
indicator in Section 9-5 for particular variables. Here, s, is a complete cases indicator because it is 
one if and only if we observe a complete case. This is the default for any econometrics or statistics 
package: any (i, t) pair where any element of (Yin Xii» - - - , Xj.) 18 missing is dropped from the estima- 
tion. (We discuss solutions to missing data problems for cross-sectional data when selection cannot 
be ignored in Chapter 17). With this definition, the appropriate time average of {y,,} can be written as 


T. 
=i 
yi =T; Dsidin 
= 


where 7; is the total number of complete time periods for cross-sectional observation i. In other words, 
we only average over the time periods that have a complete set of data. 

Another subtle point is that when time period dummies are included in the model, or any other 
variables that change only by ¢ and not i, we must now include their time averages (unlike in the 
balanced case, where the time averages are just constants). For example, if {w;: t = 1,..., T} is an 
aggregate time variable, such as a time dummy or a linear time trend, then 


T. 

— pel 

w; = T; D siw, 
= 


Because of the unbalanced nature of the panel, w; almost always varies somewhat across i (unless the 
exact same time periods are missing for all cross-sectional units). As with variables that actually change 
across i and ¢, the time averages of aggregate time effects are easy to obtain in many software packages. 

The mechanics of the random effects estimator also change somewhat when we have an unbal- 
anced panel, and this is true whether we use the traditional random effects estimator or the CRE 
version. Namely, the parameter 0 in equation (14.10), used in equation (14.11) to obtain the quasi- 
demeaned data, depends on i through the number of time periods observed for unit i. Specifically, 
simply replace T in equation (14.10) with 7;. Econometrics packages that support random effects 
estimation recognize this difference when using balanced panels, and so nothing special needs to be 
done from a user’s perspective. 

The bottom line is that, once the time averages have been properly obtained, using an equation 
such as (14.17) is the same as in the balanced case. We can still use a test of statistical significance 
on the set of time averages to choose between fixed effects and pure random effects, and the CRE 
approach still allows us to include time-constant variables. 

As with fixed effects estimation, a key issue is understanding why the panel data set is unbal- 
anced. In the pure random effects case, the selection indicator, s;,, cannot be correlated with the com- 
posite error in equation (14.7), a; + uj, in any time period. Otherwise, as discussed in Wooldridge 
(2010, Chapter 19), the RE estimator is inconsistent. As discussed in Section 14-1, the FE estimator 
allows for arbitrary correlation between the selection indicator, s;, and the fixed effect, a;. Therefore, 
the FE estimator is more robust in the context of unbalanced panels. And, as we already know, FE 
allows arbitrary correlation between time-varying explanatory variables and a;. 


14-4 General Policy Analysis with Panel Data 


In Section 13-4 we discussed how two periods of panel data can be used to evaluate policy inter- 
ventions in a before-after design. In particular, we observe each unit i—such as an individual, firm, 
school district, or county—in two periods. In the initial period, none of the units is subject to the 
intervention. Then, an intervention applies to a subset of the units, and we then obtain a second year 
of data for all units post intervention. In the case where we do not control for additional explanatory 
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variables, the estimated policy effects is the panel data version of the difference-in-differences estima- 
tor in equation (13.33). 

We now know that, with T = 2 time periods, there is no difference between the FE and FD esti- 
mators. Therefore, applying fixed effects to the program evaluation equation 


Vie = NM + azd2, + Bw + a; + upt = 1, 2, 


where w; = 0 for all i, produces the DD estimator, B DD- 

The two-period, before-after setting can be powerful, but it is a special case of a more general 
policy analysis framework. A general framework is easy to implement, and the results can be convinc- 
ing provided one accounts for aggregate time effects through a full set of time period dummies and 
eliminates unoberved heterogeneity via first differencing or fixed effects (which can be implemented 
as a CRE approach). A general equation is 


Ya = M + &d2, +: + a dT, + Bwa + xjab + a; +u,t=1,...,T, [14.22] 


where w; is the binary intervention indicator and our interest is in B. This equation allows for flex- 
ible aggregate time effects and other controls, x;,. The program indicator, w, can have any pattern. 
Typically, some units never participate, so w; = 0, t = 1,...,7. Itis often the case that w;, = 0 for the 
early time periods but then turns on in later time periods, but we might have staggered interventions, 
in which, say, some counties are subject to a policy change in 2010 and others not until 2012. It is 
important to be sure that w; is properly defined for each i and t, but once that is done, estimation and 
inference are straightforward. 

To allow w; to be systematically related to a;, (14.22) should be estimated by FE or FD, with 
cluster-robust standard errors. One does nothing special because w; is binary. In particular, if one uses 
FD, the estimating equation is 


Ay;, = apAd2, + +++ + apAdT, + BAw, + Axia + Aup t = 2,...,T. [14.23] 


Then we can also use FE estimation and compare the results. 
Allowing dynamic effects is also straightforward. For example, to allow up to a two-period 
lagged effect, we can estimate 


Ya = M + d2, +: + ardT, + Bowi + BiWit-1 + BWir-2 + Xip 
+ a; + upt = 3,...,T, [14.24] 


by FD or FE. Modern econometrics packages make it easy to include lags and, if FD is preferred to 
FE, to difference along with including lags. 

In estimating an equation such as (14.22) and (14.24), it is important not to try to shoehorn the 
analysis into a traditional difference-in-differences framework. The DID setup is a special case of 
(14.22), where T = 2 and w,, = 0 for all i. Further, in the basic DID setup, the program indicator can 
be written as w, = prog; ° d2,,t = 1,2, where prog; indicates whether unit 7 was eventually subject 
to the intervention and d2, is the dummy variable indicating the second time period. In cases with 
staggered interventions and multiple time periods, there is no way to write w, in this simple way. And 
there is no need to. The key is to define w, so that w, = 1 when unit i is “treated” at time t, and zero 
otherwise. This is true whether the analysis is static or dynamic. 


14-4a Advanced Considerations with Policy Analysis 


Unless w;, is randomized, even FE and FD estimation of (14.22), or dynamic extensions, can produce 
unreliable results if w; reacts to past shocks. More precisly, consider a shock to y; at time t—say, yit 
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is a poverty rate in county i at time f, and w; is some measure of government assistance. (It need not 
be binary.) If u; is unusually high, this might affect w,,,;—-that is, spending on, say, unemployment 
benefits, may increase in the following year. Such feedback is a violation of the strict exogeneity 
assumption, and spells trouble for both FE and FD. It turns out that, with at least three time periods, 
one can easily test for feedback: simply add next period’s w; to the equation and estimate the result- 
ing equation by FE (or FD). In particular, estimate 


Ya = M + Ayd2, +: + ar- d(T = 1), + Bwa + wiri + Xip 
+a,t+u,t=1,...,T—- 1, [14.25] 


by FE and compute a cluster-robust ¢ statistic ford . The idea is that, controlling for the intervention 
this period (and maybe lags, if appropriate), next period’s policy assignment should not help to pre- 
dict the outcome this period. We use fixed effects because w;, can be correlated with a; in every time 
period. In implementing this test, we should not ascribe any meaning to the coefficients, except that 
the sign of ô can tell us something about the nature of any feedback. We estimate (14.25), losing the 
last time period, to determine whether there is likely bias in FE or FD. (As discussed in Section 14. 1b, 
FE has some resiliency to feedback that FD does not, but with small T the FE estimator can still have 
substantial bias). 

Some literatures refer to a test like that obtained in (14.25) as a falsification test. For example, 
in the economics of education, such tests are used to determine whether students are assigned to 
classrooms using a nonrandom assignment mechanism. For concreteness, let y; be the outcome on 
a standardized test for student i in grade f¢, and let w; denote class size. The test of H: 6 = 0 sim- 
ply means that future class size cannot have a causal effect on this year’s exam peformance, once 
we control for current class size, other covariates, and unobserved effects. [If lagged class size 
matters, then we would apply the test to a model such as (14.24).] In estimating so-called teacher 
value added, the variable w; is replaced with a vector of binary variables indicating teacher assign- 
ment. Then, we are testing whether, say, the fifth grade teacher helps to predict a fourth grade test 
outcome. 

Wooldridge (2010, Chapter 11) discusses extensions of the basic model that can be applied to 
policy analysis (as well as panel data generally). One useful model is a heterogeneous trend model 
(sometimes called a random trend model): 


Ya = TN + Qod2, + ++ + ærdT, + Bwy + Xiah + a; + git + upt = 1,...,T, [14.26] 


where the new term is g;t, a unit-specific time trend. The key is, like a;, g; is unobserved. If the policy 
intervention is correlated not just with level differences among units—as captured by the aj—but also 
trend differences, now captured by g;t, then the usual FE or FD analysis can be biased and inconsist- 
ent for E. One still needs to account for aggregate fluctuations in a flexible way by including aggre- 
gate time dummies (although one will drop out because the unit-specific linear trend replaces one 
aggregate time effect). 

One approach to estimating (14.26) is straightforward. First difference the equation to remove q;: 


Ay; = a,Ad2, + ++: + apAdT, + BAw; + Ax; + gi + Aup t = 2,...,T, [14.27] 


where we use the simple fact gt — g;(t — 1) = g;. Now, equation (14.27) has the same structure as 
the basic unobserved effects model, with unobserved effect gi, except that every variable has been 
differenced. As discussed in Wooldridge (2010, Section 11.7), an attractive strategy is to apply FE 
estimation to (14.27), and obtain cluster-robust standard errors. Note that doing so requires us to 
start with T = 3 time periods. This is the cost in allowing for a second source of heterogeneity that is 
allowed to be correlated with the explanatory variables. 
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14-5 Applying Panel Data Methods to Other Data Structures 


The various panel data methods can be applied to certain data structures that do not involve time. For 
example, it is common in demography to use siblings (sometimes twins) to account for unobserved 
family and background characteristics. Usually we want to allow the unobserved “family effect,” 
which is common to all siblings within a family, to be correlated with observed explanatory variables. 
If those explanatory variables vary across siblings within a family, differencing across sibling pairs— 
or, more generally, using the within transformation within a family—is preferred as an estimation 
method. By removing the unobserved effect, we eliminate potential bias caused by confounding fam- 
ily background characteristics. Implementing fixed effects on such data structures is rather straight- 
forward in regression packages that support FE estimation. 

As an example, Geronimus and Korenman (1992) used pairs of sisters to study the effects of 
teen childbearing on future economic outcomes. When the outcome is income relative to needs— 
something that depends on the number of children—the model is 


log(incneeds;,) = Bo + dosister2, + B,teenbrthy, 


[14.28] 
+ agep + other factors + ay + Uy, 


where f indexes family and s indexes a sister within the family. The intercept for the first sister is Bo, 
and the intercept for the second sister is By + ô. The variable of interest is teenbrth,,, which is a 
binary variable equal to one if sister s in family f had a child while a teenager. The variable agep is 
the current age of sister s in family f, Geronimus and Korenman also used some other controls. The 
unobserved variable a,, which changes only across family, is an unobserved family effect or a family 
fixed effect. The main concern in the analysis is that teenbrth is correlated with the family effect. If so, 
an OLS analysis that pools across families and sisters gives a biased estimator of the effect of teenage 
motherhood on economic outcomes. Solving this problem is simple: within each family, difference 
(14.28) across sisters to get 


Alog(incneeds) = 5) + B,Ateenbrth + B,Aage +--+» + Au; [14.29] 


GOING FURTHER 14.4 this removes the family effect, ay, and the resulting 
equation can be estimated by OLS. Notice that there 
When using the differencing method, does | is no time element here: the differencing is across 
sisters within a family. Also, we have allowed for 
differences in intercepts across sisters in (14.28), 
which leads to a nonzero intercept in the differenced 
equation, (14.29). If in entering the data the order of 
the sisters within each family is essentially random, the estimated intercept should be close to zero. 
But even in such cases it does not hurt to include an intercept in (14.29), and having the intercept 
allows for the fact that, say, the first sister listed might always be the neediest. 

Using 129 sister pairs from the 1982 National Longitudinal Survey of Young Women, Geronimus 
and Korenman first estimated 8, by pooled OLS to obtain —.33 or —.26, where the second estimate 
comes from controlling for family background variables (such as parents’ education); both estimates 
are very Statistically significant [see Table 3 in Geronimus and Korenman (1992)]. Therefore, teenage 
motherhood has a rather large impact on future family income. However, when the differenced equa- 
tion is estimated, the coefficient on teenbrth is —.08, which is small and statistically insignificant. 
This suggests that it is largely a woman’s family background that affects her future income, rather 
than teenage childbearing. 

Geronimus and Korenman looked at several other outcomes and two other data sets; in some 
cases, the within family estimates were economically large and statistically significant. They also 
showed how the effects disappear entirely when the sisters’ education levels are controlled for. 


it make sense to include dummy variables 
for the mother and father’s race in (14.28)? 
Explain. 
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Ashenfelter and Krueger (1994) used the differencing methodology to estimate the return to 
education. They obtained a sample of 149 identical twins and collected information on earnings, edu- 
cation, and other variables. Identical twins were used because they should have the same underly- 
ing ability. This can be differenced away by using twin differences, rather than OLS on the pooled 
data. Because identical twins are the same in age, gender, and race, these factors all drop out of the 
differenced equation. Therefore, Ashenfelter and Krueger regressed the difference in log(earnings) 
on the difference in education and estimated the return to education to be about 9.2% (t = 3.83). 
Interestingly, this is actually larger than the pooled OLS estimate of 8.4% (which controls for gender, 
age, and race). Ashenfelter and Krueger also estimated the equation by random effects and obtained 
8.7% as the return to education. (See Table 5 in their paper.) The random effects analysis is mechani- 
cally the same as the panel data case with two time periods. 

The samples used by Geronimus and Korenman (1992) and Ashenfelter and Krueger (1994) 
are examples of matched pairs samples. More generally, fixed and random effects methods can be 
applied to a cluster sample. A cluster sample has the same appearance as a cross-sectional data set, 
but there is an important difference: clusters of units are sampled from a population of clusters rather 
than sampling individuals from the population of individuals. In the previous examples, each family 
is sampled from the population of families, and then we obtain data on at least two family members. 
Therefore, each family is a cluster. 

As another example, suppose we are interested in modeling individual pension plan participation 
decisions. One might obtain a random sample of working individuals—say, from the United States— 
but it is also common to sample firms from a population of firms. Once the firms are sampled, one 
might collect information on all workers or a subset of workers within each firm. In either case, the 
resulting data set is a cluster sample because sampling was first at the firm level. Unobserved firm- 
level characteristics (along with observed firm characteristics) are likely to be present in participation 
decisions, and this within-firm correlation must be accounted for. Fixed effects estimation is preferred 
when we think the unobserved cluster effect—an example of which is a; in (14.12)—is correlated 
with one or more of the explanatory variables. Then, we can only include explanatory variables that 
vary, at least somewhat, within clusters. The cluster sizes are rarely the same, so we are effectively 
using fixed effects methods for unbalanced panels. 

Educational data on student outcomes can also come in the form of a cluster sample, where a 
sample of schools is obtained from the population of schools, and then information on students within 
each school is obtained. Each school acts as a cluster, and allowing a school effect to be correlated 
with key explanatory variables—say, whether a student participates in a state-sponsored tutoring 
program—is likely to be important. Because the rate at which students are tutored likely varies by 
school, it is probably a good idea to use fixed effects estimation. One often sees authors use, as a 
shorthand, “I included school fixed effects in the analysis.” 

The correlated random effects approach can be applied immediately to cluster samples because, 
for the purposes of estimation, a cluster sample acts like an unbalanced panel. Now, the averages 
that are added to the equation are within-cluster averages—for example, averages of students within 
schools. The only difference with panel data is that the notion of serial correlation in idiosyncratic 
errors is not relevant. Nevertheless, as discussed in Wooldridge (2010, Chapter 20), there are still 
good reasons for using cluster-robust standard errors, whether one uses fixed effects or correlated 
random effects. In the case of CRE, the need to cluster is obvious, as part of the cluster effect is still in 
the error term. For FE, it is less obvious, but neglected slope heterogeneity generally induces within 
cluster correlation. 

In some cases, the key explanatory variables—often policy variables—change only at the level 
of the cluster, not within the cluster. In such cases the fixed effects approach is not applicable. For 
example, we may be interested in the effects of measured teacher quality on student performance, 
where each cluster is an elementary school classroom. Because all students within a cluster have 
the same teacher, eliminating a “class effect” also eliminates any observed measures of teacher 
quality. If we have good controls in the equation, we may be justified in applying random effects 
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on the unbalanced cluster. As with panel data, the key requirement for RE to produce convinc- 
ing estimates is that the explanatory variables are uncorrelated with the unobserved cluster effect. 
Most econometrics packages allow random effects estimation on unbalanced clusters without 
much effort. 

Pooled OLS is also commonly applied to cluster samples when eliminating a cluster effect via 
fixed effects is infeasible or undesirable. However, as with panel data, the usual OLS standard errors 
are incorrect unless there is no cluster effect, and so robust standard errors that allow “cluster cor- 
relation” (and heteroskedasticity) should be used. Some regression packages have simple commands 
to correct standard errors and the usual test statistics for general within cluster correlation (as well 
as heteroskedasticity). These are the same corrections that work for pooled OLS on panel data sets, 
which we reported in Example 13.9. As an example, Papke (1999) estimates linear probability models 
for the continuation of defined benefit pension plans based on whether firms adopted defined contri- 
bution plans. Because there is likely to be a firm effect that induces correlation across different plans 
within the same firm, Papke corrects the usual OLS standard errors for cluster sampling, as well as for 
heteroskedasticity in the linear probability model. 

Before ending this section some final comments are in order. Given the readily available tools of 
fixed effects, random effects, and cluster-robust standard inference, it is tempting to find reasons to 
use clustering methods where none may exist. For example, if a set of data is obtained from a random 
sample from the population, then there is usually no reason to account for cluster effects in comput- 
ing standard errors after OLS estimation. The fact that the units can be put into groups ex post—that 
is, after the random sample has been obtained—is not a reason to make inference robust to cluster 
correlation. 

To illustrate this point, suppose that, out of the population of fourth-grade students in the United 
States, a random sample of 50,000 is obtained, these data are properly studied using standard methods 
for cross-sectional regression. It may be tempting to group the students by, say, the 50 states plus the 
District of Columbia—assuming a state identifier is included—and then treat the data as a cluster 
sample. But this would be wrong, and clustering the standard errors at the state level can produce 
standard errors that are systematically too large. Or, they might be too small because the asymptotic 
theory underlying cluster sampling assumes that we have many clusters with each cluster size being 
relatively small. In any case, a simple thought experiment shows that clustering cannot be correct. 
For example, if we know the county of residence for each student, why not cluster at the county 
level? Or, at a coarser level, we can divide the United States into nine census regions and treat those 
as the clusters—and this would give a different set of standard errors (that do not have any theoreti- 
cal justification). Taking this argument to its extreme, one could argue that we have one cluster: the 
entire United States, in which case the clustered standard errors would not be defined and inference 
would be impossible. The confusion comes about because the clusters are defined ex post—that is, 
after the random sample is obtained. In a true cluster sample, the clusters are first drawn from a popu- 
lation of clusters, and then individuals are drawn from the clusters. See Abadie, Athey, Imbens, and 
Wooldridge (2018) for further discussion. 

One case where clustering is attractive, even after random sampling, is when the key policy vari- 
able is applied at a group level, and then these groups become the clusters. For example, we may be 
evaluating a county-level intervention—such as a minimum wage law—where we obtain labor market 
information for a random sample of individuals. Whether we cluster or not depends on the thought 
experiment concerning the causal effect we hope to estimate. If we choose not to cluster, then the 
standard error is valid only for the treatment effect that conditions on the particular configuration 
that we actually observe: some counties have a minimum wage equal to the federal average, some 
are higher. But the configuration could be different, especially in the future. By not clustering we 
only account for sampling variation due to the random sampling. If we cluster, we also account for 
uncertainty due to the policy assignment. Not surprisingly, accounting for assignment uncertainty 
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(in addition to sampling uncertainty) can greatly increase the standard errors, but ideally we can 
obtain a confidence interval for the causal effect that accounts for different potential assignments. 
Unfortunately, for some problems, clustering is not possible. This is the case when we do not have 
many groups—such as the standard difference-in-differences setup where we have four groups (the 
control and treated each in the before and after periods). Then, our only recourse is to compute the 
usual heteroskedasticity-robust standard errors. Wooldridge (2003) contains further discussion. 

A similar situation occurs when a cluster-level variable is created after a random sample is 
obtained. For example, suppose we have a random sample of students and we use the students within 
a school to compute a school-level poverty variable. The goal might be to determine whether there 
are peer effects due to poverty status. Using such a variable can effectively create cluster correlation 
within school in a student-level equation. Thus, one should generally account for cluster correlation 
when using peer effects-type variables. However, it should be remembered that clustering standard 
errors is justified only if we have a “reasonably large” number of clusters and the cluster sizes are not 
too large. Unfortunately, it is difficult to pin down precise sample sizes that support clustering, but in 
some cases it can work well with as few as 30 or even 20 clusters (if the clusters are relatively small). 
See, for example, Hansen (2007). 


Summary 


In this chapter we have continued our discussion of panel data methods, studying the fixed effects and 
random effects estimators, and also described the correlated random effects approach as a unifying 
framework. Compared with first differencing, the fixed effects estimator is efficient when the idiosyn- 
cratic errors are serially uncorrelated (as well as homoskedastic), and we make no assumptions about 
correlation between the unobserved effect a; and the explanatory variables. As with first differencing, any 
time-constant explanatory variables drop out of the analysis. Fixed effects methods apply immediately to 
unbalanced panels, but we must assume that the reasons some time periods are missing are not systemati- 
cally related to the idiosyncratic errors. 

The random effects estimator is appropriate when the unobserved effect is thought to be uncorre- 
lated with all the explanatory variables. Then, a; can be left in the error term, and the resulting serial 
correlation over time can be handled by generalized least squares estimation. Conveniently, feasible GLS 
can be obtained by a pooled regression on quasi-demeaned data. The value of the estimated transforma- 
tion parameter, 6, indicates whether the estimates are likely to be closer to the pooled OLS or the fixed 
effects estimates. If the full set of random effects assumptions holds, the random effects estimator is 
asymptotically—as N gets large with T fixed—more efficient than pooled OLS, first differencing, or fixed 
effects (which are all unbiased, consistent, and asymptotically normal). 

The correlated random effects approach to panel data models has become more popular in recent 
years, primarily because it allows a simple test for choosing between FE and RE, and it allows one to incor- 
porate time-constant variables in an equation that delivers the FE estimates of the time-varying variables. 

We discussed a general framework for policy analysis, in which any number of time periods and any 
pattern of “treatment” assignment is allowed. The basic difference-in-differences is a special case. But by 
carefully defining the intervention indicator, it is easy to allow dynamic effects and to estimate the equation 
by FE or FD. 

Finally, the panel data methods studied in Chapters 13 and 14 can be used when working with 
matched pairs or cluster samples. Differencing or the within transformation eliminates the cluster effect. If 
the cluster effect is uncorrelated with the explanatory variables, pooled OLS can be used, but the standard 
errors and test statistics should be adjusted for cluster correlation. Random effects estimation is also a pos- 
sibility. 
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Key Terms 


Cluster Effect 

Cluster Sample 

Clustering 

Complete Cases Indicator 
Composite Error Term 
Correlated Random Effects 
Dummy Variable Regression 


Falsification test 

Fixed Effects Estimator 
Fixed Effects Transformation 
Heterogeneous trend model 
Matched Pairs Samples 
Quasi-Demeaned Data 
Random Effects Estimator 


Random Effects Model 
Time-Demeaned Data 
Unbalanced Panel 
Unobserved Effects Model 
Within Estimator 

Within Transformation 


Problems 


1 Suppose that the idiosyncratic errors in (14.4), {uz t=1,2,...,T}, are serially uncorrelated with 


constant variance, a7. Show that the correlation between adjacent differences, Au; and At 41,18 —.5. 
Therefore, under the ideal FE assumptions, first differencing induces negative serial correlation of a 
known value. 


2 With a single explanatory variable, the equation used to obtain the between estimator is 


Yi = Bo + Bix; + a; + Uj, 


where the overbar represents the average over time. We can assume that E(a;) = 0 because we have 

included an intercept in the equation. Suppose that u; is uncorrelated with x,, but Cov(x;,, a;) = Oy, for 

all t (and i because of random sampling in the cross section). 

(i) Letting Bi be the between estimator, that is, the OLS estimator using the time averages, show 
that 


plim Bı = Bı T OT xq/Var(X;), 


where the probability limit is defined as N —> ~. [Hint: See equations (5.5) and (5.6).] 

(ii) Assume further that the x;,, for allt = 1,2,..., T, are uncorrelated with constant variance a. 
Show that plim B, = B, + T(o,,/07). 

(iii) If the explanatory variables are not very highly correlated across time, what does part (ii) 
suggest about whether the inconsistency in the between estimator is smaller when there are 
more time periods? 


In a random effects model, define the composite error v; = a; + uj, where a; is uncorrelated with u; 
and the u; have constant variance g4 and are serially uncorrelated. Define e; = vy — 0v; where 0 is 
given in (14.10). 

(i) Show that E(e,,) = 0. 

(ii) Show that Var(e,,) = 02,f = 1,...,T. 

Gii) Show that for t # s, Cov(e;, e) = 0. 


In order to determine the effects of collegiate athletic performance on applicants, you collect data on 

applications for a sample of Division I colleges for 1985, 1990, and 1995. 

(i) | What measures of athletic success would you include in an equation? What are some of the 
timing issues? 

(ii) What other factors might you control for in the equation? 

(iii) Write an equation that allows you to estimate the effects of athletic success on the percentage change 
in applications. How would you estimate this equation? Why would you choose this method? 


5 Suppose that, for one semester, you can collect the following data on a random sample of college 


juniors and seniors for each class taken: a standardized final exam score, percentage of lectures 
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attended, a dummy variable indicating whether the class is within the student’s major, cumulative 

grade point average prior to the start of the semester, and SAT score. 

(i) | Why would you classify this data set as a cluster sample? Roughly, how many observations 
would you expect for the typical student? 

(ii) Write a model, similar to equation (14.28), that explains final exam performance in terms of 
attendance and the other characteristics. Use s to subscript student and c to subscript class. 
Which variables do not change within a student? 

(ii) If you pool all of the data and use OLS, what are you assuming about unobserved student 
characteristics that affect performance and attendance rate? What roles do SAT score and prior 
GPA play in this regard? 

(iv) If you think SAT score and prior GPA do not adequately capture student ability, how would you 
estimate the effect of attendance on final exam performance? 


Using the “cluster” option in the econometrics package Stata® 11, the fully robust standard errors 

for the pooled OLS estimates in Table 14.2—that is, robust to serial correlation and heteroskedastic- 

ity in the composite errors, {v,: t = 1,..., T}—are obtained as se(Beaue) = O11, se( Brack) = .051, 

S€(Brispan) = -039, Se(Bexper) = -020, se(Bexperr) = 0010, se(Bmarriea) = -026, and se(Bynion) = -027. 

(i) | How do these standard errors generally compare with the nonrobust ones, and why? 

(ii) How do the robust standard errors for pooled OLS compare with the standard errors for 
RE? Does it seem to matter whether the explanatory variable is time-constant or time- 
varying? 

(iii) When the fully robust standard errors for the RE estimates are computed, Stata® 11 reports 
the following (where we look at only the coefficients on the time-varying variables): 
8€(Bexper) = 0.16, Se(Bexpersg) = 0008, se(Brarriea) = 0.19, and se(Bnion) = 0.21. [These 
are robust to any kind of serial correlation or heteroskedasticity in the idiosyncratic errors 
{u;: t = 1,..., T} as well as heteroskedasticity in a;.] How do the robust standard errors 
generally compare with the usual RE standard errors reported in Table 14.2? What conclusion 
might you draw? 

(iv) Comparing the four standard errors in part (iii) with their pooled OLS counterparts, what do you 
make of the fact that the robust RE standard errors are all below the robust pooled OLS standard 
errors? 


The data in CENSUS2000 is a random sample of individuals from the United States. Here we are 
interested in estimating a simple regression model relating the log of weekly income, /weekinc, to 
schooling, educ. There are 29,501 observations. Associated with each individual is a state identifier 
(state) for the 50 states plus the District of Columbia. A less coarse geographic identifier is puma, 
which takes on 610 different values indicating geographic regions smaller than a state. 

Running the simple regression of Iweekinc on educ gives a slope coefficient equal to .1083 (to 
four decimal places). The heteroskedasticity-robust standard error is about .0024. The standard error 
clustered at the puma level is about .0027, and the standard error clustered at the state level is about 
.0033. For computing a confidence interval, which of these standard errors is the most reliable? 
Explain. 


Consider the unobserved effects panel data model for a random draw i, where a, denotes 
different year intercepts: 


Yu = d2 + +++ + adT + Bixi + Borin + BX + Vien + YZ + VX2 
+ YaXinZ + a) + up, t = 1,2,...,T 
Assuming a balanced panel, write down, for a given i, the CRE equation that you would esti- 


mate using the RE estimator. Which parameter estimates should be identifical to the fixed effects 
estimates? 
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Computer Exercises 


C1 


C2 


C3 


C4 


Use the data in RENTAL for this exercise. The data on rental prices and other variables for college 
towns are for the years 1980 and 1990. The idea is to see whether a stronger presence of students 
affects rental rates. The unobserved effects model is 


log(rent;,) = Bo + doy90, + B,log(pop,) ate Bolog(avginc;,) 
+ B3pctstuy, + a; + Uin 


where pop is city population, avginc is average income, and pctstu is student population as a percent- 

age of city population (during the school year). 

(i) Estimate the equation by pooled OLS and report the results in standard form. What do you 
make of the estimate on the 1990 dummy variable? What do you get for Bea? 

(ii) Are the standard errors you report in part (i) valid? Explain. 

(iti) Now, difference the equation and estimate by OLS. Compare your estimate of 8, cis, With that 
from part (i). Does the relative size of the student population appear to affect rental prices? 

(iv) Estimate the model by fixed effects to verify that you get identical estimates and standard errors 
to those in part (iii). 


Use CRIME4 for this exercise. 

(i) | Reestimate the unobserved effects model for crime in Example 13.9 but use fixed effects rather 
than differencing. Are there any notable sign or magnitude changes in the coefficients? What 
about statistical significance? 

(ii) Add the logs of each wage variable in the data set and estimate the model by fixed effects. How 
does including these variables affect the coefficients on the criminal justice variables in part (i)? 

(iii) Do the wage variables in part (ii) all have the expected sign? Explain. Are they jointly significant? 


For this exercise, we use JTRAIN to determine the effect of the job training grant on hours of job train- 
ing per employee. The basic model for the three years is 


hrsemp;, = Bo + 5\d88, + 65d89, + Bi grant, + Brgrant;,—| 
+ B;log(employ,;,) + a; + u; 


(i) Estimate the equation using fixed effects. How many firms are used in the FE estimation? How 
many total observations would be used if each firm had data on all variables (in particular, 
hrsemp) for all three years? 

(ii) Interpret the coefficient on grant and comment on its significance. 

(iii) Is it surprising that grant__, is insignificant? Explain. 

(iv) Do larger firms provide their employees with more or less training, on average? How big are 
the differences? (For example, if a firm has 10% more employees, what is the change in average 
hours of training?) 


In Example 13.8, we used the unemployment claims data from Papke (1994) to estimate the effect of 
enterprise zones on unemployment claims. Papke also uses a model that allows each city to have its 
own time trend: 


log(uclms;,) = a; + cit + ByeZy + Ui 


where a; and c; are both unobserved effects. This allows for more heterogeneity across cities. 
(i) | Show that, when the previous equation is first differenced, we obtain 


Alog(uclms;,,) = c; + B,Aez, + Au t = 2,...,T. 


Notice that the differenced equation contains a fixed effect, c;. 
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(ii) Estimate the differenced equation by fixed effects. What is the estimate of 8,? Is it very 
different from the estimate obtained in Example 13.8? Is the effect of enterprise zones still 
statistically significant? 

(iii) Add a full set of year dummies to the estimation in part (ii). What happens to the estimate 
of B,? 


(i) Inthe wage equation in Example 14.4, explain why dummy variables for occupation might be 
important omitted variables for estimating the union wage premium. 

(i) If every man in the sample stayed in the same occupation from 1981 through 1987, would you 
need to include the occupation dummies in a fixed effects estimation? Explain. 

(iii) Using the data in WAGEPAN include eight of the occupation dummy variables in the equation 
and estimate the equation using fixed effects. Does the coefficient on union change by much? 
What about its statistical significance? 


Add the interaction term union,,-t to the equation estimated in Table 14.2 to see if wage growth depends 
on union status. Estimate the equation by random and fixed effects and compare the results. 


Use the state-level data on murder rates and executions in MURDER for the following exercise. 
(i) | Consider the unobserved effects model 


mrdrte;, = n, + Byexec;, + B.unem,, + a; + Uin 


where 7, simply denotes different year intercepts and a; is the unobserved state effect. If past 
executions of convicted murderers have a deterrent effect, what should be the sign of B,? What 
sign do you think $, should have? Explain. 

(ii) Using just the years 1990 and 1993, estimate the equation from part (i) by pooled OLS. Ignore 
the serial correlation problem in the composite errors. Do you find any evidence for a deterrent 
effect? 

(iii) Now, using 1990 and 1993, estimate the equation by fixed effects. You may use first 
differencing because you are only using two years of data. Is there evidence of a deterrent 
effect? How strong? 

(iv) Compute the heteroskedasticity-robust standard error for the estimation in part (ii). 

(v) Find the state that has the largest number for the execution variable in 1993. (The variable 
exec is total executions in 1991, 1992, and 1993.) How much bigger is this value than the next 
highest value? 

(vi) Estimate the equation using first differencing, dropping Texas from the analysis. Compute the 
usual and heteroskedasticity-robust standard errors. Now, what do you find? What is going on? 

(vii) Use all three years of data and estimate the model by fixed effects. Include Texas in the 
analysis. Discuss the size and statistical significance of the deterrent effect compared with only 
using 1990 and 1993. 


Use the data in MATHPNL for this exercise. You will do a fixed effects version of the first differencing 
done in Computer Exercise 11 in Chapter 13. The model of interest is 


math4, = 6,y94, +--+ + dsy98, 4 ylog(rexpp;,) F yolog(rexpp;,—1) 
+ yslog(enrol,,) + Wolunchy + a; + Uin 


where the first available year (the base year) is 1993 because of the lagged spending variable. 

(i) Estimate the model by pooled OLS and report the usual standard errors. You should include an 
intercept along with the year dummies to allow a; to have a nonzero expected value. What are 
the estimated effects of the spending variables? Obtain the OLS residuals, Ŷ;. 

Gi) Is the sign of the lunch; coefficient what you expected? Interpret the magnitude of the 
coefficient. Would you say that the district poverty rate has a big effect on test pass rates? 
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C10 


(iii) Compute a test for AR(1) serial correlation using the regression Ŷ; on Ŷ;,—1. You should use the 
years 1994 through 1998 in the regression. Verify that there is strong positive serial correlation 
and discuss why. 

(iv) Now, estimate the equation by fixed effects. Is the lagged spending variable still significant? 

(v) Why do you think, in the fixed effects estimation, the enrollment and lunch program variables 
are jointly insignificant? 

(vi) Define the total, or long-run, effect of spending as 6, = y; + yz. Use the substitution 
yı = 9; — y to obtain a standard error for ô.. [Hint: Standard fixed effects estimation using 
log(rexpp;,) and z; = log(rexpp;,,-;) — log(rexpp;,) as explanatory variables should do it.] 


The file PENSION contains information on participant-directed pension plans for U.S. workers. Some 
of the observations are for couples within the same family, so this data set constitutes a small cluster 
sample (with cluster sizes of two). 
(i) Ignoring the clustering by family, use OLS to estimate the model 
pcetstck = By + B,choice + B.prftshr + Ba female + Bage 
+ Bseduc + B6finc25 + B,finc35 + Bs fincSO + Bofinc75 
+ Biofincl00 + B,,fincl0Il + B,,wealth89 + B,3stckin8&9 
+ Bygirain89 + u, 


where the variables are defined in the data set. The variable of most interest is choice, which 
is a dummy variable equal to one if the worker has a choice in how to allocate pension 
funds among different investments. What is the estimated effect of choice? Is it statistically 
significant? 
(ii) Are the income, wealth, stock holding, and IRA holding control variables important? Explain. 
Gii) Determine how many different families there are in the data set. 
(iv) Now, obtain the standard errors for OLS that are robust to cluster correlation within a family. 
Do they differ much from the usual OLS standard errors? Are you surprised? 
(v) Estimate the equation by differencing across only the spouses within a family. Why do the 
explanatory variables asked about in part (ii) drop out in the first-differenced estimation? 
(vi) Are any of the remaining explanatory variables in part (v) significant? Are you surprised? 


Use the data in AIRFARE for this exercise. We are interested in estimating the model 


log( fare;,) = n, + Byconcen;, + Bylog(dist;) + B3[log(dist;) P 
+a, + upt = 1,...,4, 


where 7, means that we allow for different year intercepts. 

(i) Estimate the above equation by pooled OLS, being sure to include year dummies. If 
Aconcen = .10, what is the estimated percentage increase in fare? 

(ii) What is the usual OLS 95% confidence interval for B,? Why is it probably not reliable? If you 
have access to a statistical package that computes fully robust standard errors, find the fully 
robust 95% CI for B,;. Compare it to the usual CI and comment. 

(iii) Describe what is happening with the quadratic in log(dist). In particular, for what value of dist does 
the relationship between log(fare) and dist become positive? [Hint: First figure out the turning point 
value for log(dist), and then exponentiate.] Is the turning point outside the range of the data? 

(iv) Now estimate the equation using random effects. How does the estimate of 8, change? 

(v) Now estimate the equation using fixed effects. What is the FE estimate of 64? Why is it fairly 
similar to the RE estimate? (Hint: What is 6 for RE estimation?) 

(vi) Name two characteristics of a route (other than distance between stops) that are captured by a;. 
Might these be correlated with concen;,? 

(vii) Are you convinced that higher concentration on a route increases airfares? What is your best estimate? 
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This question assumes that you have access to a statistical package that computes standard errors 

robust to arbitrary serial correlation and heteroskedasticity for panel data methods. 

(i) For the pooled OLS estimates in Table 14.1, obtain the standard errors that allow for arbitrary 
serial correlation (in the composite errors, v;, = a; + Uy) and heteroskedasticity. How do the 
robust standard errors for educ, married, and union compare with the nonrobust ones? 

(ii) | Now obtain the robust standard errors for the fixed effects estimates that allow arbitrary serial 
correlation and heteroskedasticity in the idiosyncratic errors, u; How do these compare with the 
nonrobust FE standard errors? 

(iii) For which method, pooled OLS or FE, is adjusting the standard errors for serial correlation 
more important? Why? 


Use the data in ELEM94_95 to answer this question. The data are on elementary schools in 
Michigan. In this exercise, we view the data as a cluster sample, where each school is part of a 
district cluster. 

(i) | What are the smallest and largest number of schools in a district? What is the average number 
of schools per district? 

(i) Using pooled OLS (that is, pooling across all 1,848 schools), estimate a model relating lavgsal 
to bs, lenrol, lstaff, and lunch; see also Computer Exercise 11 from Chapter 9. What are the 
coefficient and standard error on bs? 

(iii) Obtain the standard errors that are robust to cluster correlation within district (and also 
heteroskedasticity). What happens to the ż statistic for bs? 

(iv) Still using pooled OLS, drop the four observations with bs > .5 and obtain B,, and its cluster- 
robust standard error. Now is there much evidence for a salary-benefits tradeoff? 

(v) Estimate the equation by fixed effects, allowing for a common district effect for schools within 
a district. Again drop the observations with bs > .5. Now what do you conclude about the 
salary-benefits tradeoff? 

(vi) In light of your estimates from parts (iv) and (v), discuss the importance of allowing teacher 
compensation to vary systematically across districts via a district fixed effect. 


The data set DRIVING includes state-level panel data (for the 48 continental U.S. states) from 1980 
through 2004, for a total of 25 years. Various driving laws are indicated in the data set, including the 
alcohol level at which drivers are considered legally intoxicated. There are also indicators for “per se” 
laws—where licenses can be revoked without a trial—and seat belt laws. Some economics and demo- 
graphic variables are also included. 

(i) | How is the variable totfatrte defined? What is the average of this variable in the years 1980, 
1992, and 2004? Run a regression of foffatrte on dummy variables for the years 1981 through 
2004, and describe what you find. Did driving become safer over this period? Explain. 

(ii) Add the variables bac08, bac10, perse, sbprim, sbsecon, sl70plus, gdl, perc14_24, unem, and 
vehicmilespc to the regression from part (i). Interpret the coefficients on bac8 and bac10. Do 
per se laws have a negative effect on the fatality rate? What about having a primary seat belt 
law? (Note that if a law was enacted sometime within a year the fraction of the year is recorded 
in place of the zero-one indicator.) 

(iii) Reestimate the model from part (ii) using fixed effects (at the state level). How do the 
coefficients on bac08, bac10, perse, and sbprim compare with the pooled OLS estimates? 
Which set of estimates do you think is more reliable? 

(iv) Suppose that vehicmilespc, the number of miles driven per capita, increases by 1,000. Using 
the FE estimates, what is the estimated effect on toffatrte? Be sure to interpret the estimate as if 
explaining to a layperson. 

(v) If there is serial correlation or heteroskedasticity in the idiosyncratic errors of the model then the 
standard errors in part (iii) are invalid. If possible, use “cluster” robust standard errors for the fixed 
effects estimates. What happens to the statistical significance of the policy variables in part (iii)? 
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C14 Use the data set in AIRFARE to answer this question. The estimates can be compared with those in 
Computer Exercise 10, in this Chapter. 


C15 


C16 


(i) | Compute the time averages of the variable concen; call these concenbar. How many different 
time averages can there be? Report the smallest and the largest. 

(ii) Estimate the equation 
Ifare;, = Bo + 6,y98, + 6,y99, + 63y00, + Byconcen;, + Baldist; + Bldistsq; + 
y,concenbar, + a; + u; by random effects. Verify that B , is identical to the FE estimate 
computed in C10. 

(iii) If you drop /dist and Idistsq from the estimation in part (i) but still include concenbar,, what 
happens to the estimate of 8,? What happens to the estimate of y,? 

(iv) Using the equation in part (ii) and the usual RE standard error, test Hp: y; = 0 against the two- 
sided alternative. Report the p-value. What do you conclude about RE versus FE for estimating 
B, in this application? 

(v) If possible, for the test in part (iv) obtain a t-statistic (and, therefore, p-value) that is robust to 
arbitrary serial correlation and heteroskedasticity. Does this change the conclusion reached in 
part (iv)? 

Use the data in COUNTYMURDERS to answer this question. The data set covers murders and execu- 

tions (capital punishment) for 2,197 counties in the United States. See also Computer Exercise C16 in 

Chapter 13. 

(i) Consider the model 


(ii) 


(iii) 
(iv) 


(v) 


murdrate;, = 0, + dyexecs;, + dexecs;,—, + ÔeXeCSi 2 + d3execs;,-3 + 
Bspercblack,, + Bepercmale;, + B7perc1019;, + Bsperc2029;, + a; + Uj, 


where 6, represents a different intercept for each time period, a; is the county fixed effect, and u; 
is the idiosyncratic error. Why does it make sense to include lags of the key variable, execs, in 
the equation? 

Apply OLS to the equation from part (i) and report the estimates of 5p, 6,, 65, and 63, along with 
the usual pooled OLS standard errors. Do you estimate that executions have a deterrent effect 
on murders? Provide an explanation that involves a;. 

Now estimate the equation in part (i) using fixed effects to remove a;. What are the new 
estimates of the 6;? Are they very different from the estimates from part (ii)? 

Obtain the long-run propensity from estimates in part (iii). Using the usual FE standard errors, 
is the LRP statistically different from zero? 

If possible, obtain the standard errors for the FE estimates that are robust to arbitrary 
heteroskedasticity and serial correlation in the {u;}. What happens to the statistical significance 
of the 6? What about the estimated LRP? 


Use the data in WAGEPAN.DTA to answer the following questions. 


(iii) 


(iv) 


Using wage as the dependent variable, estimate a model that only contains an intercept and 
the year dummies d81 through d87. Use pooled OLS, RE, FE, and FD (where in the latter case 
you difference the year dummies, along with /wage, and omit an overall constant in the FD 
regression). What do you conclude about the coefficients on the year dummies? 

Add the time-constant variables educ, black, and hisp to the model, and estimate it by 

OLS and RE. How do the coefficients compare? What happens if you estimate the equation 
by FE? 

What do you conclude about the four estimation methods when the model includes only 
variables that change just across ż or just across i? 

Now estimate the equation 


lwage;, = a, + Bunion, + B married; + Bzeduc; + Byblack; + Bshisp; + Ci + ui 


by random effects. Do the coefficients seem reasonable? How do the nonrobust and cluster- 
robust standard errors compare? 
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(v) Now estimate the equation 
lwage; = a, + Bunion, + Bomarried;, + Ci + ui, 


by fixed effects, being sure to include the full set of time dummies to reflect the different 
interecepts. How do the estimates of 8, and B, compare with those in part (iv)? Compute the 
usual FE standard errors and the cluster-robust standard errors. How do they compare? 

(vi) Obtain the time averages union, and married,. Along with educ, black, and hisp, add these to the 
equation from part (iv). Verify that the CRE estimates of 6, and £; are identical to the FE estimates. 

(vii) Obtain the robust, variable addition Hauman test. What do you conclude about RE versus FE? 

(viii) Let educ have an interactive effect with both union and married and estimate the model by 
fixed effects. Are the interactions individually or jointly significant? Why are the coefficients on 
union and married now imprecisely estimated? 


(ix) Estimate the average partial effects of union and married for the model estimated in part (viii). 
How do these compare with the FE estimates from part (v)? 

(x) Verify that for the model in part (viii) the CRE estimates are the same as the FE estimates 
when they should be. (Hint: You need to include union,, married, educ, * union,, and educ; * 
married in the CRE estimation.) 


Use the data in SCHOOL93_98 to answer the following questions. Use the command xtset schid 

year to set the cross section and time dimensions. 

(i) How many schools are there. Does each school have a record for each of the six years? Verify 
that Javgrexpp is missing for all schools in 1993. 

(ii) Create a selection indicator, s, that is equal to one if and only if you have nonmissing data on 
math4, lavgrexpp, lunch, and lenrol. Next, define a variable tobs to be the number of complete 
time periods per school. How many schools have all given years of available data (noting that 
1993 is not available for any school when we use lavgrexpp)? Drop all schools with tobs = 0. 

(iii) Use random effects to estimate a model relating math4 to lavgrexpp, lunch, and lenrol. Be sure 
to include a full set of year dummies. What is the estimated effect of school spending on math4? 
What is its cluster-robust f statistic? 

(iv) Now estimate the model from part (iii) by fixed effects. What is the estimated spending effect 
and its robust confidence interval? How does it compare to the RE estimate from part (iii)? 

(v) Create the time averages of all of the explanatory variables in the RE/FE estimation, including the 
time dummies. You need to use the selection indicator constructed in part (ii). Verify that when 
you add these and estimate the equation by RE you obtain the FE estimates on the time-varying 
explanatory variables. What happens if you drop the time averages for y95, y96, y97, and y98? 

(vi) Is the random effects estimator rejected in favor of fixed effects? Explain. 


APPENDIX 14A 


14A.1 Assumptions for Fixed and Random Effects 


In this appendix, we provide statements of the assumptions for fixed and random effects estimation. 
We also provide a discussion of the properties of the estimators under different sets of assumptions. 
Verification of these claims is somewhat involved, but can be found in Wooldridge (2010, Chapter 10). 


Assumption FE.1 
For each i, the model is 
Vi = Brin to + Bertin + a; + Uy, t= 1,...,T, 


where the §; are the parameters to estimate and a; is the unobserved effect. 
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Assumption FE.2 


We have a random sample from the cross section. 


Assumption FE.3 


Each explanatory variable changes over time (for at least some i), and no perfect linear relationships 
exist among the explanatory variables. 


Assumption FE.4 


For each f, the expected value of the idiosyncratic error given the explanatory variables in all time 
periods and the unobserved effect is zero: E(u;[X;, a;) = 0. 


Under these first four assumptions—which are identical to the assumptions for the first- 
differencing estimator—the fixed effects estimator is unbiased. Again, the key is the strict exogene- 
ity assumption, FE.4. Under these same assumptions, the FE estimator is consistent with a fixed T 
as N > œ, 


Assumption FE.5 
Var(u;|X;,a;) = War(u;,) = 02, for allt = 1,...,T. 


Assumption FE.6 


For all t # s, the idiosyncratic errors are uncorrelated (conditional on all explanatory variables 
and a;): Cov(ujptt;|X;, a) = 0. 


Under Assumptions FE.1 through FE.6, the fixed effects estimator of the 6; is the best linear un- 
biased estimator. Because the FD estimator is linear and unbiased, it is necessarily worse than the FE 
estimator. The assumption that makes FE better than FD is FE.6, which implies that the idiosyncratic 
errors are serially uncorrelated. 


Assumption FE.7 
Conditional on X; and a;, the u, are independent and identically distributed as Normal(0, 02). 


Assumption FE.7 implies FE.4, FE.5, and FE.6, but it is stronger because it assumes a normal dis- 
tribution for the idiosyncratic errors. If we add FE.7, the FE estimator is normally distributed, and t 
and F statistics have exact t and F distributions. Without FE.7, we can rely on asymptotic approxima- 
tions, but, without making special assumptions, these approximations require large N and small T. 

The ideal random effects assumptions include FE.1, FE.2, FE.4, FE.5, and FE.6. (FE.7 could be 
added but it gains us little in practice because we have to estimate 0.) Because we are only subtract- 
ing a fraction of the time averages, we can now allow time-constant explanatory variables. So, FE.3 
is replaced with the following assumption: 


Assumption RE.1 


There are no perfect linear relationships among the explanatory variables. 
The cost of allowing time-constant regressors is that we must add assumptions about how the unob- 
served effect, a;, is related to the explanatory variables. 


Assumption RE.2 


In addition to FE.4, the expected value of a;, given all explanatory variables, is constant: 
E(a|X;) = Bo. 
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This is the assumption that rules out correlation between the unobserved effect and the explana- 
tory variables, and it is the key distinction between fixed effects and random effects. Because we are 
assuming a; is uncorrelated with all elements of x,,, we can include time-constant explanatory vari- 
ables. (Technically, the quasi-time-demeaning only removes a fraction of the time average, and not 
the whole time average.) We allow for a nonzero expectation for a; in stating Assumption RE.4 so 
that the model under the random effects assumptions contains an intercept, Bo, as in equation (14.7). 
Remember, we would typically include a set of time-period intercepts, too, with the first year acting 
as the base year. 

We also need to impose homoskedasticity on a; as follows: 


Assumption RE.3 
In addition to FE.5, the variance of a; given all explanatory variables, is constant: Var(a;|X;) = oZ. 


Under the six random effects assumptions (FE.1, FE.2, RE.3, RE.4, RE.5, and FE.6), the RE 
estimator is consistent and asymptotically normally distributed as N gets large for fixed T. Actu- 
ally, consistency and asymptotic normality follow under the first four assumptions, but without the 
last two assumptions the usual RE standard errors and test statistics would not be valid. In addition, 
under the six RE assumptions, the RE estimators are asymptotically efficient. This means that, in 
large samples, the RE estimators will have smaller standard errors than the corresponding pooled 
OLS estimators (when the proper, robust standard errors are used for pooled OLS). For coefficients 
on time-varying explanatory variables (the only ones estimable by FE), the RE estimator is more 
efficient than the FE estimator—often much more efficient. But FE is not meant to be efficient under 
the RE assumptions; FE is intended to be robust to correlation between a; and the x; As often hap- 
pens in econometrics, there is a tradeoff between robustness and efficiency. See Wooldridge (2010, 
Chapter 10) for verification of the claims made here. 


14A.2 Inference Robust to Serial Correlation and Heteroskedasticity for 
Fixed Effects and Random Effects 


One of the key assumptions for performing inference using the FE, RE, and even the CRE 
approach to panel data models is the assumption of no serial correlation in the idiosyncratic errors, 
{u t = 1,..., T}—see Assumption FE.6. Of course, heteroskedasticity can also be an issue, but 
this is also ruled out for standard inference (see Assumption FE.5). As discussed in the appendix to 
Chapter 13, the same issues can arise with first differencing estimation when we have T = 3 time 
periods. 

Fortunately, as with FD estimation, there are now simple solutions for fully robust inference— 
inference that is robust to arbitrary violations of Assumptions FE.5 and FE.6 and, when applying 
the RE or CRE approaches, to Assumption RE.5. As with FD estimation, the general approach to 
obtaining fully robust standard errors and test statistics is known as clustering. Now, however, the 
clustering is applied to a different equation. For example, for FE estimation, the clustering is applied 
to the time-demeaned equation (14.5). For RE estimation, the clustering gets applied to the quasi- 
time-demeaned equation (14.11) (and a similar comment holds for CRE, but there the time aver- 
ages are included as separate explanatory variables). The details, which can be found in Wooldridge 
(2010, Chapter 10) are too advanced for this text. But understanding the purpose of clustering is 
not: if possible, we should compute standard errors, confidence intervals, and test statistics that are 
valid in large cross sections under the weakest set of assumptions. The FE estimator requires only 
Assumptions FE.1 to FE.4 for unbiasedness and consistency (as N —> © with T fixed). Thus, a care- 
ful researcher at least checks whether inference made robust to serial correlation and heteroskedas- 
ticity in the errors affects inference. Experience shows that it often does. 
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Applying cluster-robust inference to account for serial correlation within a panel data context 
is easily justified when N is substantially larger than T. Under certain restrictions on the time series 
dependence, of the sort discussed in Chapter 11, cluster-robust inference for the fixed effects estima- 
tor can be justified when T is of a similar magnitude as N, provided both are not small. This follows 
from the work by Hansen (2007). Generally, clustering is not theoretically justified when N is small 
and T is large. 

Computing the cluster-robust statistics after FE or RE estimation is simple in many economet- 
rics packages, often only requiring an option of the form “‘cluster(id)” appended to the end of FE and 
RE estimation commands. As in the FD case, “id” refers to a cross-section identifier. Similar com- 
ments hold when applying FE or RE to cluster samples, as the cluster identifier. 


Instrumental Variables 
Estimation and Two Stage 
Least Squares 


n this chapter, we further study the problem of endogenous explanatory variables in multiple 

regression models. In Chapter 3, we derived the bias in the OLS estimators when an important 

variable is omitted; in Chapter 5, we showed that OLS is generally inconsistent under omitted 
variables. Chapter 9 demonstrated that omitted variables bias can be eliminated (or at least mitigated) 
when a suitable proxy variable is given for an unobserved explanatory variable. Unfortunately, 
suitable proxy variables are not always available. 

In the previous two chapters, we explained how fixed effects estimation or first differencing can 
be used with panel data to estimate the effects of time-varying independent variables in the presence 
of time-constant omitted variables. Although such methods are very useful, we do not always have 
access to panel data. Even if we can obtain panel data, it does us little good if we are interested in the 
effect of a variable that does not change over time: first differencing or fixed effects estimation elimi- 
nates time-constant explanatory variables. In addition, the panel data methods that we have studied so 
far do not solve the problem of time-varying omitted variables that are correlated with the explanatory 
variables. 

In this chapter, we take a different approach to the endogeneity problem. You will see how the 
method of instrumental variables (IV) can be used to solve the problem of endogeneity of one or more 
explanatory variables. The method of two stage least squares (2SLS or TSLS) is second in popularity 
only to ordinary least squares for estimating linear equations in applied econometrics. 

We begin by showing how IV methods can be used to obtain consistent estimators in the presence 


of omitted variables. IV can also be used to solve the errors-in-variables problem, at least under 
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certain assumptions. Chapter 16 will demonstrate how to estimate simultaneous equations models 
using IV methods. 

Our treatment of instrumental variables estimation closely follows our development of ordinary 
least squares in Part 1, where we assumed that we had a random sample from an underlying popula- 
tion. This is a desirable starting point because, in addition to simplifying the notation, it emphasizes 
that the important assumptions for IV estimation are stated in terms of the underlying population (just 
as with OLS). As we showed in Part 2, OLS can be applied to time series data, and the same is true of 
instrumental variables methods. Section 15-7 discusses some special issues that arise when IV meth- 
ods are applied to time series data. In Section 15-8, we cover applications to pooled cross sections and 


panel data. 


15-1 Motivation: Omitted Variables in a Simple Regression Model 


When faced with the prospect of omitted variables bias (or unobserved heterogeneity), we have so 
far discussed three options: (1) we can ignore the problem and suffer the consequences of biased and 
inconsistent estimators; (2) we can try to find and use a suitable proxy variable for the unobserved 
variable; or (3) we can assume that the omitted variable does not change over time and use the fixed 
effects or first-differencing methods from Chapters 13 and 14. The first response can be satisfactory 
if the estimates are coupled with the direction of the biases for the key parameters. For example, if we 
can say that the estimator of a positive parameter, say, the effect of job training on subsequent wages, 
is biased toward zero and we have found a statistically significant positive estimate, we have still 
learned something: job training has a positive effect on wages, and it is likely that we have underes- 
timated the effect. Unfortunately, the opposite case, where our estimates may be too large in magni- 
tude, often occurs, which makes it very difficult for us to draw any useful conclusions. 

The proxy variable solution discussed in Section 9-2 can also produce satisfying results, but it is 
not always possible to find a good proxy. This approach attempts to solve the omitted variable prob- 
lem by replacing the unobservable with one or more proxy variables. 

Another approach leaves the unobserved variable in the error term, but rather than estimating the 
model by OLS, it uses an estimation method that recognizes the presence of the omitted variable. This 
is what the method of instrumental variables does. 

For illustration, consider the problem of unobserved ability in a wage equation for working 
adults. A simple model is 


log(wage) = By + Byeduc + Babil + e, 


where e is the error term. In Chapter 9, we showed how, under certain assumptions, a proxy variable 
such as JQ can be substituted for ability, and then a consistent estimator of 6, is available from the 
regression of 


log(wage) on educ, IQ. 


Suppose, however, that a proxy variable is not available (or does not have the properties needed to 
produce a consistent estimator of 64). Then, we put abil into the error term, and we are left with the 
simple regression model 


log(wage) = By + Bieduc + u, [15.1] 


where u contains abil. Of course, if equation (15.1) is estimated by OLS, a biased and inconsistent 
estimator of $, results if educ and abil are correlated. 
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It turns out that we can still use equation (15.1) as the basis for estimation, provided we can find 
an instrumental variable for educ. To describe this approach, the simple regression model is written as 


y = Po + Bx + u, [15.2] 
where we think that x and u are correlated (have nonzero covariance): 
Cov(x,u) # 0. [15.3] 


The method of instrumental variables works whether or not x and u are correlated, but, for reasons we 
will see later, OLS should be used if x is uncorrelated with u. 

In order to obtain consistent estimators of By and 6, when x and u are correlated, we need some 
additional information. The information comes by way of a new variable that satisfies certain prop- 
erties. Suppose that we have an observable variable z that satisfies these two assumptions: (1) z is 
uncorrelated with u, that is, 


Cov(z,u) = 0; [15.4] 
(2) zis correlated with x, that is, 
Cov(z,x) # 0. [15.5] 


Then, we call z an instrumental variable for x, or sometimes simply an instrument for x. 

The requirement that the instrument z satisfies (15.4) is summarized by saying “z is exogenous in 
equation (15.2),” and so we often refer to (15.4) as instrument exogeneity. In the context of omitted vari- 
ables, instrument exogeneity means that z should have no partial effect on y (after x and omitted variables 
have been controlled for), and z should be uncorrelated with the omitted variables. Equation (15.5) means 
that z must be related, either positively or negatively, to the endogenous explanatory variable x. This condi- 
tion is sometimes referred to as instrument relevance (as in “z is relevant for explaining variation in x”). 

There is a very important difference between the two requirements for an instrumental variable. 
Because (15.4) involves the covariance between z and the unobserved error u, we cannot generally 
hope to test this assumption: in most cases, we must maintain Cov(z,u) = 0 by appealing to economic 
behavior or introspection. Sometimes, we might have an observable proxy variable for some factor 
contained in u, in which case we can check to see if z and the proxy variable are roughly uncorrelated. 
Of course, if we have a good proxy for an important element of u, we might just add the proxy as an 
explanatory variable and estimate the expanded equation by ordinary least squares. See Section 9-2. 

Some readers may be wondering why we do not attempt to check (15.4) by using the following 
procedure. Given a sample of size n, obtain the OLS residuals, ii,, from the regression y, on x;. Then, 
devise a test based on the sample correlation between z; and i; as a check on whether z; and the unob- 
served errors u; are correlated. A moment’s thought reveals the logical problem with this procedure. The 
entire reason for moving beyond OLS is that we think the OLS estimators of 8, and 6, are inconsistent 
due to correlation between x and u. Therefore, in computing the OLS residuals ĉ; = y; — Bo = BiXp 
we are not getting useful estimates of the u;. Therefore, we can learn nothing by studying the correlation 
between z; and û;. A related suggestion is to use the OLS regression y; on x; z; and to conclude z; satis- 
fies the exogeneity requirement if its coefficient is statistically insignificant. Again, this procedure does 
not work, regardless of the outcome of the test, because x is allowed to be endogenous. The bottom line 
is that, in the current setting, we have no way of testing (15.4) unless we use external information. 

By contrast, the condition that z is correlated with x (in the population) can be tested, given a 
random sample from the population. The easiest way to do this is to estimate a simple regression 
between x and z. In the population, we have 


X= % + Wz v. [15.6] 


Then, because 7, = Cov(z,x)/Var(z), assumption (15.5) holds if, and only if, m, # 0. Thus, we 
should be able to reject the null hypothesis 


Hy: 7, = 0 [15.7] 
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against the two-sided alternative Hy: m, # 0, at a sufficiently small significance level. If this is the 
case, then we can be fairly confident that (15.5) holds. 

For the log(wage) equation in (15.1), an instrumental variable z for educ must be (1) uncorrelated 
with ability (and any other unobserved factors affecting wage) and (2) correlated with education. 
Something such as the last digit of an individual’s Social Security Number almost certainly satisfies 
the first requirement: it is uncorrelated with ability because it is determined randomly. However, it is 
precisely because of the randomness of the last digit of the SSN that it is not correlated with educa- 
tion, either; therefore it makes a poor instrumental variable for educ because it violates the instrument 
relevance requirement in equation (15.5). 

What we have called a proxy variable for the omitted variable makes a poor IV for the opposite 
reason. For example, in the log(wage) example with omitted ability, a proxy variable for abil should 
be as highly correlated as possible with abil. An instrumental variable must be uncorrelated with abil. 
Therefore, while JỌ is a good candidate as a proxy variable for abil, it is not a good instrumental vari- 
able for educ because it violates the instrument exogeneity requirement in equation (15.4). 

Whether other possible instrumental variable candidates satisfy the exogeneity requirement in 
(15.4) is less clear-cut. In wage equations, labor economists have used family background variables 
as IVs for education. For example, mother’s education (motheduc) is positively correlated with child’s 
education, as can be seen by collecting a sample of data on working people and running a simple 
regression of educ on motheduc. Therefore, motheduc satisfies equation (15.5). The problem is that 
mother’s education might also be correlated with child’s ability (through mother’s ability and perhaps 
quality of nurturing at an early age), in which case (15.4) fails. 

Another IV choice for educ in (15.1) is number of siblings while growing up (sibs). Typically, 
having more siblings is associated with lower average levels of education. Thus, if number of siblings 
is uncorrelated with ability, it can act as an instrumental variable for educ. 

As a second example, consider the problem of estimating the causal effect of skipping classes on 
final exam score. In a simple regression framework, we have 


score = By + B skipped + u, [15.8] 


where score is the final exam score and skipped is the total number of lectures missed during the 
semester. We certainly might be worried that skipped is correlated with other factors in u: more able, 
highly motivated students might miss fewer classes. Thus, a simple regression of score on skipped 
may not give us a good estimate of the causal effect of missing classes. 

What might be a good IV for skipped? We need something that has no direct effect on score and 
is not correlated with student ability and motivation. At the same time, the IV must be correlated 
with skipped. One option is to use distance between living quarters and classrooms. Especially at 
large universities, some living quarters will be further from a student’s classrooms, and this may 
essentially be a random occurrence. Some students live off campus while others commute long dis- 
tances. Living further away from classrooms may increase the likelihood of missing lectures due to 
bad weather, oversleeping, and so on. Thus, skipped may be positively correlated with distance; this 
can be checked by regressing skipped on distance and doing a t test, as described earlier. 

Is distance uncorrelated with u? In the simple regression model (15.8), some factors in u may be cor- 
related with distance. For example, students from low-income families may live off campus; if income 
affects student performance, this could cause distance to be correlated with u. Section 15-2 shows how 
to use IV in the context of multiple regression, so that other factors affecting score can be included 
directly in the model. Then, distance might be a good IV for skipped. An IV approach may not be neces- 
sary at all if a good proxy exists for student ability, such as cumulative GPA prior to the semester. 

There is a final point worth emphasizing before we turn to the mechanics of IV estimation: namely, 
in using the simple regression in equation (15.6) to test (15.7), it is important to take note of the sign 
(and even magnitude) of 7, and not just its statistical significance. Arguments for why a variable z 
makes a good IV candidate for an endogenous explanatory variable x should include a discussion about 
the nature of the relationship between x and z. For example, due to genetics and background influences 
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it makes sense that child’s education (x) and mother’s education (z) are positively correlated. If in your 
sample of data you find that they are actually negatively correlated—that is, 7, < O0—then your use of 
mother’s education as an IV for child’s education is likely to be unconvincing. [And this has nothing 
to do with whether condition (15.4) is likely to hold.] In the example of measuring whether skipping 
classes has an effect on test performance, one should find a positive, statistically significant relationship 
between skipped and distance in order to justify using distance as an IV for skipped: a negative rela- 
tionship would be difficult to justify [and would suggest that there are important omitted variables driv- 
ing a negative correlation—variables that might themselves have to be included in the model (15.8)]. 

We now demonstrate that the availability of an instrumental variable can be used to estimate 
consistently the parameters in equation (15.2). In particular, we show that assumptions (15.4) and 
(15.5) serve to identify the parameter 6. Identification of a parameter in this context means that we 
can write 6, in terms of population moments that can be estimated using a sample of data. To write B, 
in terms of population covariances, we use equation (15.2): the covariance between z and y is 


Cov(z,y) = B,;Cov(z,x) + Cov(z,u). 


Now, under assumption (15.4), Cov(z,u) = 0, and under assumption (15.5), Cov(z,x) # 0. Thus, we 
can solve for B, as 


7 Cov(z,y) 
Cov(z,x) 


[Notice how this simple algebra fails if z and x are uncorrelated, that is, if Cov(z,x) = 0.] Equation 
(15.9) shows that 6; is the population covariance between z and y divided by the population covari- 
ance between z and x, which shows that 6, is identified. Given a random sample, we estimate the 
population quantities by the sample analogs. After canceling the sample sizes in the numerator and 
denominator, we get the instrumental variables (IV) estimator of 6: 


[15.9] 


1 


n 


D(z — 2% — y) 
ĝi = , [15.10] 


n 


> (z = zy = x) 


i=1 


Given a sample of data on x, y, and z, it is simple to obtain the IV estimator in (15.10). The IV estima- 
tor of By is simply Bo =y- Bix, which looks just like the OLS intercept estimator except that the 
slope estimator, B 1 is now the IV estimator. 

It is no accident that when z = x we obtain the OLS estimator of 6;. In other words, 
when x is exogenous, it can be used as its own IV, and the IV estimator is then identical to the OLS 
estimator. 

A simple application of the law of large numbers shows that the IV estimator is consistent for 
By: plim(,) = B,, provided assumptions (15.4) and (15.5) are satisfied. If either assumption fails, 
the IV estimators are not consistent (more on this later). One feature of the IV estimator is that, when 
x and u are in fact correlated—so that instrumental variables estimation is actually needed—it is 
essentially never unbiased. This means that, in small samples, the IV estimator can have a substantial 
bias, which is one reason why large samples are preferred. 

When discussing the application of instrumental variables it is important to be careful with 
language. Like OLS, IV is an estimation method. It makes little sense to refer to “an instrumental 
variables model”—just as the phrase “OLS model” makes little sense. As we know, a model is an 
equation such as (15.8), which is a special case of the generic model in equation (15.2). When we 
have a model such as (15.2), we can choose to estimate the parameters of that model in many differ- 
ent ways. Prior to this chapter we focused primarily on OLS, but, for example, we also know from 
Chapter 8 that one can use weighted least squares as an alternative estimation method (and there are 
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unlimited possibilities for the weights). If we have an instrumental variable candidate z for x, then we 
can instead apply instrumental variables estimation. It is certainly true that the estimation method we 
apply is motivated by the model and assumptions we make about that model. But the estimators are 
well defined and exist apart from any underlying model or assumptions: remember, an estimator is 
simply a rule for combining data. The bottom line is that while we probably know what a researcher 
means when using a phrase such as “I estimated an IV model,” such language betrays a lack of under- 
standing about the difference between a model and an estimation method. 


15-1a Statistical Inference with the IV Estimator 


Given the similar structure of the IV and OLS estimators, it is not surprising that the IV estimator 
has an approximate normal distribution in large sample sizes. To perform inference on B,, we need a 
standard error that can be used to compute ż statistics and confidence intervals. The usual approach 
is to impose a homoskedasticity assumption, just as in the case of OLS. Now, the homoskedasticity 
assumption is stated conditional on the instrumental variable, z, not the endogenous explanatory vari- 
able, x. Along with the previous assumptions on u, x, and z, we add 


E(w|z) = o° = Var(u). [15.11] 


It can be shown that, under (15.4), (15.5), and (15.11), the asymptotic variance of Êi is 


co 


=e, [15.12] 
where g? is the population variance of x, g? is the population variance of u, and pz , is the square of 
the population correlation between x and z. This tells us how highly correlated x and z are in the popu- 
lation. As with the OLS estimator, the asymptotic variance of the IV estimator decreases to zero at the 
rate of 1/n, where n is the sample size. 

Equation (15.12) is interesting for two reasons. First, it provides a way to obtain a standard error 
for the IV estimator. All quantities in (15.12) can be consistently estimated given a random sample. To 
estimate a, we simply compute the sample variance of x;; to estimate pŽ ., we can run the regression 
of x; on z; to obtain the R-squared, say, Rex Finally, to estimate o-’, we can use the IV residuals, 


a; = y; — Bo — Bix, i= 1,2,...,n, 


where Be and B , are the IV estimates. A consistent estimator of g? looks just like the estimator of o° 
from a simple OLS regression: 


2 


ô 


where it is standard to use the degrees of freedom correction (even though this has little effect as the 
sample size grows). 

The (asymptotic) standard error of Êi is the square root of the estimated asymptotic variance, the 
latter of which is given by 


ô? 


E 15.1 
ssi, pate 


where SST, is the total sum of squares of the x;. [Recall that the sample variance of x; is SST,/n, and 
so the sample sizes cancel to give us (15.13).] The resulting standard error can be used to construct 
either ¢ statistics for hypotheses involving 8, or confidence intervals for 64. Bo also has a standard 
error that we do not present here. Any modern econometrics package computes the standard error 
after any IV estimation; there is rarely any reason to perform the calculations by hand. 
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A second reason (15.12) is interesting is that it allows us to compare the asymptotic variances of 
the IV and the OLS estimators (when x and u are uncorrelated). Under the Gauss-Markov assumptions, 
the variance of the OLS estimator is 7?/SST,, while the comparable formula for the IV estimator is 
o’/(SST,:R;, .); they differ only in that R? , appears in the denominator of the IV variance. Because an 
R-squared is always less than one, the IV variance is always larger than the OLS variance (when OLS 
is valid). If R z is small, then the IV variance can be much larger than the OLS variance. Remember, 
R? , measures the strength of the linear relationship between x and z in the sample. If x and z are only 
slightly correlated, RŽ . can be small, and this can translate into a very large sampling variance for the 
IV estimator. The more highly correlated z is with x, the closer R? . is to one, and the smaller is the var- 
iance of the IV estimator. In the case that z = x, R? , = 1, and we get the OLS variance, as expected. 

The previous discussion highlights an important cost of performing IV estimation when x and u 
are uncorrelated: the asymptotic variance of the IV estimator is always larger, and sometimes much 
larger, than the asymptotic variance of the OLS estimator. 


Estimating the Return to Education for Married Women 


We use the data on married working women in MROZ to estimate the return to education in the 
simple regression model 


log(wage) = By + Byeduc + u. [15.14] 
For comparison, we first obtain the OLS estimates: 
log(wage) = —.185 + .109 educ 
(.185) (.014) [15.15] 


n = 428, R? = 118. 


The estimate for 6; implies an almost 11% return for another year of education. 

Next, we use father’s education (fatheduc) as an instrumental variable for educ. We have to main- 
tain that fatheduc is uncorrelated with u. The second requirement is that educ and fatheduc are cor- 
related. We can check this very easily using a simple regression of educ on fatheduc (using only the 
working women in the sample): 


Educ = 10.24 + .269 fatheduc 
(.28) (.029) [15.16] 
n = 428, R? = .173. 


The ż statistic on fatheduc is 9.28, which indicates that educ and fatheduc have a statistically signifi- 
cant positive correlation. (In fact, fatheduc explains about 17% of the variation in educ in the sample.) 
Using fatheduc as an IV for educ gives 


log(wage) = .441 + .059 educ 
(.446) (.035) [15.17] 
n = 428, R? = .093. 


The IV estimate of the return to education is 5.9%, which is barely more than one-half of the OLS esti- 
mate. This suggests that the OLS estimate is too high and is consistent with omitted ability bias. But 
we should remember that these are estimates from just one sample: we can never know whether .109 
is above the true return to education, or whether .059 is closer to the true return to education. Further, 
the standard error of the IV estimate is two and one-half times as large as the OLS standard error (this 
is expected, for the reasons we gave earlier). The 95% confidence interval for 6, using OLS is much 
tighter than that using the IV; in fact, the IV confidence interval actually contains the OLS estimate. 
Therefore, although the differences between (15.15) and (15.17) are practically large, we cannot say 
whether the difference is statistically significant. We will show how to test this in Section 15-5. 
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In the previous example, the estimated return to education using IV was less than that using 
OLS, which corresponds to our expectations. But this need not have been the case, as the following 
example demonstrates. 


Estimating the Return to Education for Men 


We now use WAGE? to estimate the return to education for men. We use the variable sibs (number of siblings) 
as an instrument for educ. These are negatively correlated, as we can verify from a simple regression: 


—~ 


educ = 14.14 — .228 sibs 
(.11) (.030) 
n = 935, R? = .057. 


This equation implies that every sibling is associated with, on average, about .23 less of a year of edu- 
cation. If we assume that sibs is uncorrelated with the error term in (15.14), then the IV estimator is 
consistent. Estimating equation (15.14) using sibs as an IV for educ gives 


Tog(wage) = 5.13 + .122 educ 
(36) (.026) 
n = 935. 


(The R-squared is computed to be negative, so we do not report it. A discussion of R-squared in the 
context of IV estimation follows.) For comparison, the OLS estimate of 8, is .059 with a standard error 
of .006. Unlike in the previous example, the IV estimate is now much higher than the OLS estimate. 
Although we do not know whether the difference is statistically significant, this does not mesh with the 
omitted ability bias from OLS. It could be that sibs is also correlated with ability: more siblings means, on 
average, less parental attention, which could result in lower ability. Another interpretation is that the OLS 
estimator is biased toward zero because of measurement error in educ. This is not entirely convincing 
because, as we discussed in Section 9-3, educ is unlikely to satisfy the classical errors-in-variables model. 


In the previous examples, the endogenous explanatory variable (educ) and the instrumental variables 
(fatheduc, sibs) have quantitative meaning. But nothing prevents the explanatory variable or IV from 
being binary variables. Angrist and Krueger (1991), in their simplest analysis, came up with a clever 
binary instrumental variable for educ, using census data on men in the United States. Let frstqrt be equal 
to one if the man was born in the first quarter of the year, and zero otherwise. It seems that the error term 
in (15.14)—and, in particular, ability—should be unrelated to quarter of birth. But frstqgrt also needs to 
be correlated with educ. It turns out that years of education do differ systematically in the population 
based on quarter of birth. Angrist and Krueger argued persuasively that this is due to compulsory school 
attendance laws in effect in all states. Briefly, students born early in the year typically begin school at an 
older age. Therefore, they reach the compulsory schooling age (16 in most states) with somewhat less 
education than students who begin school at a younger age. For students who finish high school, Angrist 
and Krueger verified that there is no relationship between years of education and quarter of birth. 

Because years of education varies only slightly across quarter of birth—which means R? . in 
(15.13) is very small—Angrist and Krueger needed a very large sample size to get a reasonably precise 
IV estimate. Using 247,199 men born between 1920 and 1929, the OLS estimate of the return to 
education was .0801 (standard error .0004), and the IV estimate was .0715 (.0219); these are reported 
in Table III of Angrist and Krueger’s paper. Note how large the ¢ statistic is for the OLS estimate (about 
200), whereas the f statistic for the IV estimate is only 3.26. Thus, the IV estimate is statistically dif- 
ferent from zero, but its confidence interval is much wider than that based on the OLS estimate. 

An interesting finding by Angrist and Krueger is that the IV estimate does not differ much from 
the OLS estimate. In fact, using men born in the next decade, the IV estimate is somewhat higher 
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than the OLS estimate. One could interpret this as showing that there is no omitted ability bias when 
wage equations are estimated by OLS. However, the Angrist and Krueger paper has been criticized on 
econometric grounds. As discussed by Bound, Jaeger, and Baker (1995), it is not obvious that season 
of birth is unrelated to unobserved factors that affect wage. As we will explain in the next subsection, 
even a small amount of correlation between z and u can cause serious problems for the IV estimator. 

For policy analysis, the endogenous explanatory variable is often a binary variable. For example, 
Angrist (1990) studied the effect that being a veteran of the Vietnam War had on lifetime earnings. 
A simple model is 


log(earns) = By + B,veteran + u, [15.18] 


where veteran is a binary variable. The problem with estimating this equation by OLS is that there 
may be a self-selection problem, as we mentioned in Chapter 7: perhaps people who get the most out 
of the military choose to join, or the decision to join is correlated with other characteristics that affect 
earnings. These will cause veteran and u to be correlated. 
- Angrist pointed out that the Vietnam draft lottery 
GOING FURTHER 15.1 provided a natural experiment (see also Chapter 13) 

If some men who were assigned low that created an instrumental variable for veteran. 
draft lottery numbers obtained additional | Young men were given lottery numbers that deter- 
schooling to reduce the probability of being | mined whether they would be called to serve in 
drafted, is lottery number a good instrument | Vietnam. Because the numbers given were (eventu- 
for veteran in (15.18)? ally) randomly assigned, it seems plausible that draft 
lottery number is uncorrelated with the error term u. 
But those with a low enough number had to serve in Vietnam, so that the probability of being a vet- 
eran is correlated with lottery number. If both of these assertions are true, draft lottery number is a 
good IV candidate for veteran. 

It is also possible to have a binary endogenous explanatory variable and a binary instrumental 
variable. See Problem 1 for an example. 


15-1b Properties of IV with a Poor Instrumental Variable 


We have already seen that, though IV is consistent when z and u are uncorrelated and z and x have any 
positive or negative correlation, IV estimates can have large standard errors, especially if z and x are 
only weakly correlated. Weak correlation between z and x can have even more serious consequences: 
the IV estimator can have a large asymptotic bias even if z and u are only moderately correlated. 

We can see this by studying the probability limit of the IV estimator when z and u are possibly 
correlated. Letting B ı,ı1y denote the IV estimator, we can write 


Corr(z,u) 
Corr(z,x) 


where ø, and g, are the standard deviations of u and x in the population, respectively. The interest- 
ing part of this equation involves the correlation terms. It shows that, even if Corr(z,u) is small, the 
inconsistency in the IV estimator can be very large if Corr(z,x) is also small. Thus, even if we focus 
only on consistency, it is not necessarily better to use IV than OLS if the correlation between z and 
u is smaller than that between x and u. Using the fact that Corr(x,u) = Cov(x,u)/(o,o,,) along with 


A 


equation (5.3), we can write the plim of the OLS estimator—call it 64, o s—as 


: a Ou 
plim B, jy = £; 4 et [15.19] 


x 


7 om 

plim ĝi ors = 8; + Corr(x,u) - a [15.20] 
Comparing these formulas shows that it is possible for the directions of the asymptotic biases to be 
different for IV and OLS. For example, suppose Corr(x,u) > 0, Corr(z,x) > 0, and Corr(z,u) < 0. 
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Then the IV estimator has a downward bias, whereas the OLS estimator has an upward bias 
(asymptotically). In practice, this situation is probably rare. More problematic is when the direction 
of the bias is the same and the correlation between z and x is small. For concreteness, suppose x and 
z are both positively correlated with u and Corr(z,x) > 0. Then the asymptotic bias in the IV estima- 
tor is less than that for OLS only if Corr(z,w)/Corr(z,x) < Corr(x,u). If Corr(z,x) is small, then a 
seemingly small correlation between z and u can be magnified and make IV worse than OLS, even if 
we restrict attention to bias. For example, if Corr(z,x) = .2, Corr(z,u) must be less than one-fifth of 
Corr(z,u) before IV has less asymptotic bias than OLS. In many applications, the correlation between 
the instrument and x is less than .2. Unfortunately, because we rarely have an idea about the relative 
magnitudes of Corr(z,u) and Corr(x,u), we can never know for sure which estimator has the largest 
asymptotic bias [unless, of course, we assume Corr(z,u) = 0]. 

In the Angrist and Krueger (1991) example mentioned earlier, where x is years of schooling and 
z is a binary variable indicating quarter of birth, the correlation between z and x is very small. Bound, 
Jaeger, and Baker (1995) discussed reasons why quarter of birth and u might be somewhat correlated. 
From equation (15.19), we see that this can lead to a substantial bias in the IV estimator. 

When z and x are not correlated at all, things are especially bad, whether or not z is uncorre- 
lated with u. The following example illustrates why we should always check to see if the endogenous 
explanatory variable is correlated with the IV candidate. 


Estimating the Effect of Smoking on Birth Weight 


In Chapter 6, we estimated the effect of cigarette smoking on child birth weight. Without other 
explanatory variables, the model is 


log(bwght) = Bo + Bypacks + u, [15.21] 


where packs is the number of packs smoked by the mother per day. We might worry that packs is cor- 
related with other health factors or the availability of good prenatal care, so that packs and u might be 
correlated. A possible instrumental variable for packs is the average price of cigarettes in the state of 
residence, cigprice. We will assume that cigprice and u are uncorrelated (even though state support 
for health care could be correlated with cigarette taxes). 

If cigarettes are a typical consumption good, basic economic theory suggests that packs and cig- 
price are negatively correlated, so that cigprice can be used as an IV for packs. To check this, we 
regress packs on cigprice, using the data in BWGHT: 


packs = .067 + .0003 cigprice 
(.103) (.0008) 
n = 1,388, R? = .0000, R? = —.0006. 
This indicates no relationship between smoking during pregnancy and cigarette prices, which is 
perhaps not too surprising given the addictive nature of cigarette smoking. 


Because packs and cigprice are not correlated, we should not use cigprice as an IV for packs in 
(15.21). But what happens if we do? The IV results would be 


log(bwght) = 4.45 + 2.99 packs 
(.91) (8.70) 
n = 1,388 
(the reported R-squared is negative). The coefficient on packs is huge and of an unexpected sign. The 


standard error is also very large, so packs is not significant. But the estimates are meaningless because 
cigprice fails the one requirement of an IV that we can always test: assumption (15.5). 
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The previous example shows that IV estimation can produce strange results when the instrument 
relevance condition, Corr(z,x) # 0, fails. Of practically greater interest is the so-called problem 
of weak instruments, which is loosely defined as the problem of “low” (but not zero) correlation 
between z and x. In a particular application, it is difficult to define how low is too low, but recent 
theoretical research, supplemented by simulation studies, has shed considerable light on the issue. 
Staiger and Stock (1997) formalized the problem of weak instruments by modeling the correlation 
between z and x as a function of the sample size; in particular, the correlation is assumed to shrink to 
zero at the rate 1/V/n . Not surprisingly, the asymptotic distribution of the instrumental variables esti- 
mator is different compared with the usual asymptotics, where the correlation is assumed to be fixed 
and nonzero. One of the implications of the Stock—Staiger work is that the usual statistical inference, 
based on f statistics and the standard normal distribution, can be seriously misleading. We discuss this 
further in Section 15-3. 


15-1c Computing R-Squared after IV Estimation 


Most regression packages compute an R-squared after IV estimation, using the standard formula: 
R? = 1 — SSR/SST, where SSR is the sum of squared IV residuals and SST is the total sum of 
squares of y. Unlike in the case of OLS, the R-squared from IV estimation can be negative because 
SSR for IV can actually be larger than SST. Although it does not really hurt to report the R-squared 
for IV estimation, it is not very useful, either. When x and u are correlated, we cannot decompose the 
variance of y into B{Var(x) + Var(u), and so the R-squared has no natural interpretation. In addition, 
as we will discuss in Section 15-3, these R-squareds cannot be used in the usual way to compute 
F tests of joint restrictions. 

If our goal was to produce the largest R-squared, we would always use OLS. IV methods are 
intended to provide better estimates of the ceteris paribus effect of x on y when x and u are correlated; 
goodness-of-fit is not a factor. A high R-squared resulting from OLS is of little comfort if we cannot 
consistently estimate B,. 


15-2 IV Estimation of the Multiple Regression Model 


The IV estimator for the simple regression model is easily extended to the multiple regression case. 
We begin with the case where only one of the explanatory variables is correlated with the error. In 
fact, consider a standard linear model with two explanatory variables: 


Yi = Bo + Biy2 + Boz + u. [15.22] 


We call this a structural equation to emphasize that we are interested in the 6;, which simply means 
that the equation is supposed to measure a causal relationship. We use a new notation here to dis- 
tinguish endogenous from exogenous variables. The dependent variable y, is clearly endogenous, 
as it is correlated with u,. The variables y, and z; are the explanatory variables, and u, is the error. 
As usual, we assume that the expected value of u, is zero: E(u,) = 0. We use z; to indicate that this 
variable is exogenous in (15.22) (zı is uncorrelated with u,). We use y, to indicate that this variable 
is suspected of being correlated with u,. We do not specify why y, and u, are correlated, but for now 
it is best to think of u; as containing an omitted variable correlated with y). The notation in equation 
(15.22) originates in simultaneous equations models (which we cover in Chapter 16), but we use it 
more generally to easily distinguish exogenous from endogenous explanatory variables in a multiple 
regression model. 
An example of (15.22) is 


log(wage) = By + Bieduc + Boexper + u, [15.23] 
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where y; = log(wage), y, = educ, and z, = exper. In other words, we assume that exper is 
exogenous in (15.23), but we allow that educ—for the usual reasons—is correlated with u. 

We know that if (15.22) is estimated by OLS, all of the estimators will be biased and inconsist- 
ent. Thus, we follow the strategy suggested in the previous section and seek an instrumental variable 
for y,. Because z; is assumed to be uncorrelated with u, can we use z; as an instrument for y), assum- 
ing y, and z, are correlated? The answer is no. Because z, itself appears as an explanatory variable in 
(15.22), it cannot serve as an instrumental variable for y,. We need another exogenous variable—call 
it zz—that does not appear in (15.22). Therefore, key assumptions are that z, and z, are uncorrelated 
with u,; we also assume that u, has zero expected value, which is without loss of generality when the 
equation contains an intercept: 


E(u,) = 0, Cov(z,,u,;) = 0, and Cov(z,,u,) = 0. [15.24] 


Given the zero mean assumption, the latter two assumptions are equivalent to E(z,u,;) = E(zu,) = 0, 
and so the method of moments approach suggests obtaining estimators By, 8,, and $, by solving the 
sample counterparts of (15.24): 


X Ya Bo — Byn Bza) = 0 


Èz Ya Bo — Pyn Baza) =0 [15.25] 


i=1 


n 
>Za(va Bo — Bryn Boz) = 0. 
i=1 


This is a set of three linear equations in the three unknowns Bos B., and By and it is easily solved 
given the data on y, ys, Z1, and zz. The estimators are called instrumental variables estimators. If we 
think y, is exogenous and we choose z) = yz, equations (15.25) are exactly the first order conditions 
for the OLS estimators; see equations (3.13). 

We still need the instrumental variable z, to be correlated with y,, but the sense in which these 
two variables must be correlated is complicated by the presence of z; in equation (15.22). We now 
need to state the assumption in terms of partial correlation. The easiest way to state the condition is to 

write the endogenous explanatory variable as a linear 
GOING FURTHER 15.2 function of the exogenous variables and an error term: 


Suppose we wish to estimate the effect Yo = To + TZ + Tz + v, [15.26] 
of marijuana usage on college grade point f 7 
average. For the population of college | where, by construction, E(v,) = 0, Cov(z,,v2) = 0, 
seniors at a university, let daysused denote | and Cov(z,,v>) = 0, and the T; are unknown para- 


ihe numser of ceys iin ine past month on | meters. The key identification condition [along with 
which a student smoked marijuana and (15.24)] is that 


consider the structural equation 
colGPA = By + B,daysused + B,SAT + u. T, # 0. [15.27] 


i) Let percHS denote the percentage | In other words, after partialling out z,, y2 and z, are 
of a student's high school graduating class | still correlated. This correlation can be positive or 
that reported regular use of marijuana. If this | negative, but it cannot be zero. Testing (15.27) is 
is ei IV Gale Loa for ay cuseal viie ie easy: we estimate (15.26) by OLS and use a ¢ test 
o a ao -n DO yeu ünik (possibly making it robust to heteroskedasticity). We 

et aes orus, : should always test this assumption. Unfortunately, 

ii) Do you think percHS is truly : 

exogenous in the structural equation? What | We cannot test that z; and z, are uncorrelated with u; 
problems might there be? hopefully, we can make the case based on economic 
reasoning or introspection. 
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Equation (15.26) is an example of a reduced form equation, which means that we have writ- 
ten an endogenous variable in terms of exogenous variables. This name comes from simultaneous 
equations models—which we study in Chapter 16—but it is a useful concept whenever we have an 
endogenous explanatory variable. The name helps distinguish it from the structural equation (15.22). 

Adding more exogenous explanatory variables to the model is straightforward. Write the struc- 
tural model as 


Yi = Bo + B2 + Bot to + Bye + th, [15.28] 


where y, is thought to be correlated with u. Let z, be a variable not in (15.28) that is also exogenous. 
Therefore, we assume that 


E(u) = 0, Cov(z,u,) =0, j=l,...,k. [15.29] 


Under (15.29), z,,..., z,_, are the exogenous variables appearing in (15.28). In effect, these act as 
their own instrumental variables in estimating the 6; in (15.28). The special case of k = 2 is given in 
the equations in (15.25); along with z,, z; appears in the set of moment conditions used to obtain the 
IV estimates. More generally, z,,..., z,., are used in the moment conditions along with the instru- 
mental variable for y», Zy 

The reduced form for y, is 


Ya = Wy + mizy Ee + Wy Ze t M t Vos [15.30] 
and we need some partial correlation between z, and y,: 
TT, F O. [15.31] 


Under (15.29) and (15.31), z is a valid IV for y. [We do not care about the remaining 77; in (15.30); 
some or all of them could be zero.] A minor additional assumption is that there are no perfect linear 
relationships among the exogenous variables; this is analogous to the assumption of no perfect col- 
linearity in the context of OLS. 

For standard statistical inference, we need to assume homoskedasticity of u,. We give a careful 
statement of these assumptions in a more general setting in Section 15-3. 


Using College Proximity as an IV for Education 


Card (1995) used wage and education data for a sample of men in 1976 to estimate the return to educa- 
tion. He used a dummy variable for whether someone grew up near a four-year college (nearc4) as an 
instrumental variable for education. In a log(wage) equation, he included other standard controls: expe- 
rience, a black dummy variable, dummy variables for living in an SMSA and living in the South, and 
a full set of regional dummy variables and an SMSA dummy for where the man was living in 1966. 
In order for nearc4 to be a valid instrument, it must be uncorrelated with the error term in the wage 
equation—we assume this—and it must be partially correlated with educ. To check the latter require- 
ment, we regress educ on nearc4 and all of the exogenous variables appearing in the equation. (That is, 
we estimate the reduced form for educ.) Using the data in CARD, we obtain, in condensed form, 


educ = 16.64 + .320 nearc4 — .413 exper + -:- 
(.24) (.088) (.034) 
n = 3,010, R? = .477. 
We are interested in the coefficient and ¢ statistic on nearc4. The coefficient implies that in 1976, 
other things being fixed (experience, race, region, and so on), people who lived near a college in 1966 


had, on average, about one-third of a year more education than those who did not grow up near a 
college. The ¢ statistic on nearc4 is 3.64, which gives a p-value that is zero in the first three decimals. 
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Therefore, if nearc4 is uncorrelated with unobserved factors in the error term, we can use nearc4 as 
an IV for educ. 

The OLS and IV estimates are given in Table 15.1. Like the OLS standard errors, the reported 
IV standard errors employ a degrees-of-freedom adjustment in estimating the error variance. In some 
Statistical packages the degrees-of-freedom adjustment is the default; in others it is not. 

Interestingly, the IV estimate of the return to education is almost twice as large as the OLS esti- 
mate, but the standard error of the IV estimate is over 18 times larger than the OLS standard error. 
The 95% confidence interval for the IV estimate is between .024 and .239, which is a very wide 
range. The presence of larger confidence intervals is a price we must pay to get a consistent estimator 
of the return to education when we think educ is endogenous. 


TABLE 15.1 Dependent Variable: log(wage) 


Explanatory Variables OLS IV 
educ 1075 ale 
(003) (.055) 
exper .085 108 
(.007) (.024) 
exper —.0023 —.0023 
(.0003) (.0003) 
black —.199 —.147 
(.018) (.054) 
smsa 136 112 
(.020) (.032) 
south —.148 —.145 
(.026) (.027) 
Observations 3,010 3,010 
A-squared .300 .238 
Other controls: smsa66, reg662, ..., reg669 


As discussed earlier, we should not make anything of the smaller R-squared in the IV estimation: 
by definition, the OLS R-squared will always be larger because OLS minimizes the sum of squared 
residuals. 


It is worth noting, especially for studying the effects of policy interventions, that a reduced form 
equation exists for y,, too. In the context of equation (15.28) with z; an IV for y, the reduced form for 
y, always has the form 


Yi = Yo T Yili Ho + Yp es [15.32] 


where y; = B; + Bir, for j < k, Yk = PiTp, and e; = u, + Eivz—as can be verified by plugging 
(15.30) into (15.28) and rearranging. Because the z; are exogenous in (15.32), the y; can be consist- 
ently estimated by OLS. In other words, we regress y, on all of the exogenous variables, including z;, 
the IV for y,. Only if we want to estimate 6, in (15.28) do we need to apply IV. 

When y, is a zero-one variable denoting participation and z+ is a zero-one variable representing 
eligibility for program participation—which is, hopefully, either randomized across individuals or, at 
most, a function of the other exogenous variables z4, . . . , Z,_,; (such as income)—the coefficient y 
has an interesting interpretation. Rather than an estimate of the effect of the program itself, it is an 
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estimate of the effect of offering the program. Unlike 6, in (15.28)—which measures the effect of 
the program itself—y, accounts for the possibility that some units made eligible will choose not to 
participate. In the program evaluation literature, y, is an example of an intention-to-treat parameter: 
it measures the effect of being made eligible and not the effect of actual participation. The intention- 
to-treat coefficient, y, = BiT, depends on the effect of participating, 6,, and the change (typically, 
increase) in the probability of participating due to being eligible, 7,. [When y, is binary, equation 
(15.30) is a linear probability model, and therefore m, measures the ceteris paribus change in prob- 
ability that y, = 1 as z, switches from zero to one.] 


15-3 Two Stage Least Squares 


In the previous section, we assumed that we had a single endogenous explanatory variable (y), along 
with one instrumental variable for y. It often happens that we have more than one exogenous variable 
that is excluded from the structural model and might be correlated with y, which means they are valid 
IVs for y . In this section, we discuss how to use multiple instrumental variables. 


15-3a A Single Endogenous Explanatory Variable 


Consider again the structural model (15.22), which has one endogenous and one exogenous explana- 
tory variable. Suppose now that we have two exogenous variables excluded from (15.22): z, and z3. 
Our assumptions that z, and z; do not appear in (15.22) and are uncorrelated with the error u, are 
known as exclusion restrictions. 

If z, and z; are both correlated with y,, we could just use each as an IV, as in the previous sec- 
tion. But then we would have two IV estimators, and neither of these would, in general, be efficient. 
Since each of z,, z2, and z; is uncorrelated with u, any linear combination is also uncorrelated with u, 
and therefore any linear combination of the exogenous variables is a valid IV. To find the best IV, we 
choose the linear combination that is most highly correlated with y,. This turns out to be given by the 
reduced form equation for y,. Write 


Ya = Ty E WZ + MZ + 17323 + Vo, [15.33] 
where 
E(v,) = 0, Cov(z,,v.) = 0, Cov(z,,v,) = 0, and Cov(z;,v,) = 0. 

Then, the best IV for y, (under the assumptions given in the chapter appendix) is the linear combina- 

tion of the z; in (15.33), which we call y3: 
Yq = Wy + TIZ + WZ + 77323. [15.34] 
For this IV not to be perfectly correlated with z; we need at least one of 77, or 77; to be different from zero: 
T, # Oor Tm, # 0. [15.35] 


This is the key identification assumption, once we assume the z; are all exogenous. (The value of 
m, is irrelevant.) The structural equation (15.22) is not identified if m, = 0 and m, = 0. We can test 
Ho: 7, = 0 and 7; = 0 against (15.35) using an F statistic. 

A useful way to think of (15.33) is that it breaks y, into two pieces. The first is y3; this is the part 
of y, that is uncorrelated with the error term, u,. The second piece is v,, and this part is possibly cor- 
related with u;—which is why y, is possibly endogenous. 

Given data on the z;, we can compute y3 for each observation, provided we know the population 
parameters 77;. This is never true in practice. Nevertheless, as we saw in the previous section, we can 
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always estimate the reduced form by OLS. Thus, using the sample, we regress y, on Z4, Z2, and z; and 
obtain the fitted values: 


Ja = To + Tizi + TZ + 77323 [15.36] 


(that is, we have ¥,. for each 7). At this point, we should verify that z, and z; are jointly significant in 
(15.33) at a reasonably small significance level (no larger than 5%). If z) and z; are not jointly signifi- 
cant in (15.33), then we are wasting our time with IV estimation. 

Once we have j>, we can use it as the IV for yz. The three equations for estimating By, 61, and 62 
are the first two equations of (15.25), with the third replaced by 


Davi Bo Biyo Bozin) = 0. [15.37] 
i=1 


Solving the three equations in three unknowns gives us the IV estimators. 

With multiple instruments, the IV estimator using j,. as the instrument is also called the 
two stage least squares (2SLS) estimator. The reason is simple. Using the algebra of OLS, it can be 
shown that when we use yy as the IV for y,, the IV estimates Bo» Bi. and Bo are identical to the OLS 
estimates from the regression of 


yı on ĵ, and z4. [15.38] 


In other words, we can obtain the 2SLS estimator in two stages. The first stage is to run the 
regression in (15.36), where we obtain the fitted values ĵ,. The second stage is the OLS regression 
(15.38). Because we use J, in place of y,, the 2SLS estimates can differ substantially from the OLS 
estimates. 

Some economists like to interpret the regression in (15.38) as follows. The fitted value, },, is the 
estimated version of y}, and y; is uncorrelated with u,. Therefore, 2SLS first “purges” y, of its correla- 
tion with u, before doing the OLS regression in (15.38). We can show this by plugging y, = y3 + v, 
into (15.22): 


Yı = Bo + B3 + Boz + uy + Biv [15.39] 


Now, the composite error u, + 6v has zero mean and is uncorrelated with y3 and z,, which is why 
the OLS regression in (15.38) works. 

Most econometrics packages have special commands for 2SLS, so there is no need to perform 
the two stages explicitly. In fact, in most cases you should avoid doing the second stage manually, as 
the standard errors and test statistics obtained in this way are not valid. [The reason is that the error 
term in (15.39) includes v,, but the standard errors involve the variance of u, only.] Any regression 
software that supports 2SLS asks for the dependent variable, the list of explanatory variables (both 
exogenous and endogenous), and the entire list of instrumental variables (that is, all exogenous vari- 
ables). The output is typically quite similar to that for OLS. 

In model (15.28) with a single IV for y,, the IV estimator from Section 15-2 is identical to the 
2SLS estimator. Therefore, when we have one IV for each endogenous explanatory variable, we can 
call the estimation method IV or 2SLS. 

Adding more exogenous variables changes very little. For example, suppose the wage equation is 


log(wage) = By + B,educ + B,exper + Bexper + u, [15.40] 


where u, is uncorrelated with both exper and exper’. Suppose that we also think mother’s and father’s 
educations are uncorrelated with u,. Then, we can use both of these as IVs for educ. The reduced form 
(or first stage equation) equation for educ is 


educ = Ta + mw exper + m exper’ + m motheduc + t4fatheduc + v», [15.41] 


and identification requires that m, # 0 or 7, # 0 (or both, of course). 
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Return to Education for Working Women 


We estimate equation (15.40) using the data in MROZ. First, we test Ho: 73 = 0, 74 = 0 in (15.41) 
using an F test. The result is F = 124.76, and p-value = .0000. As expected, educ is (partially) cor- 
related with parents’ education. 

When we estimate (15.40) by 2SLS, we obtain, in equation form, 


log(wage) = .048 + .061 educ + .044 exper — .0009 exper 
(.400) (.031) (013) (.0004) 
n = 428, R? = .136. 


The estimated return to education is about 6.1%, compared with an OLS estimate of about 10.8%. 
Because of its relatively large standard error, the 2SLS estimate is barely statistically significant at the 
5% level against a two-sided alternative. 


The assumptions needed for 2SLS to have the desired large sample properties are given in the 
chapter appendix, but it is useful to briefly summarize them here. If we write the structural equation 
as in (15.28), 


Yi = Bo + B2 + Bot +0 + Bkr-1 F t, [15.42] 


then we assume each z; to be uncorrelated with u. In addition, we need at least one exogenous vari- 
able not in (15.42) that is partially correlated with y,. This ensures consistency. For the usual 2SLS 
standard errors and ż statistics to be asymptotically valid, we also need a homoskedasticity assump- 
tion: the variance of the structural error, u,;, cannot depend on any of the exogenous variables. For 
time series applications, we need more assumptions, as we will see in Section 15-7. 


15-3b Multicollinearity and 2SLS 


In Chapter 3, we introduced the problem of multicollinearity and showed how correlation among regres- 
sors can lead to large standard errors for the OLS estimates. Multicollinearity can be even more serious 
with 2SLS. To see why, the (asymptotic) variance of the 2SLS estimator of 6, can be approximated as 


—_——_ 


o7/|SST,(1 — R3)], [15.43] 


where o° = Var(u,), SST; is the total variation in ),, and R is the R-squared from a regression of ¥, 
on all other exogenous variables appearing in the structural equation. There are two reasons why the 
variance of the 2SLS estimator is larger than that for OLS. First, $3, by construction, has less variation 
than y,. (Remember: Total sum of squares = explained sum of squares + residual sum of squares; 
the variation in y, is the total sum of squares, while the variation in ĵ, is the explained sum of squares 
from the first stage regression.) Second, the correlation between } and the exogenous variables in 
(15.42) is often much higher than the correlation between y, and these variables. This essentially 
defines the multicollinearity problem in 2SLS. 

As an illustration, consider Example 15.4. When educ is regressed on the exogenous variables in 
Table 15.1 (not including nearc4), R-squared = .475; this is a moderate degree of multicollinearity, 
but the important thing is that the OLS standard error on Becuc is quite small. When we obtain the first 
stage fitted values, educ, and regress these on the exogenous variables in Table 15.1, R-squared = .995, 
which indicates a very high degree of multicollinearity between educ and the remaining exogenous 
variables in the table. (This high R-squared is not too surprising because educ is a function of all the 
exogenous variables in Table 15.1, plus nearc4.) Equation (15.43) shows that an R close to one can 
result in a very large standard error for the 2SLS estimator. But as with OLS, a large sample size can 
help offset a large R. 
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15-3c Detecting Weak Instruments 


In Section 15-1 we briefly discussed the problem of weak instruments. We focused on equation 
(15.19), which demonstrates how a small correlation between the instrument and error can lead to 
very large inconsistency (and therefore bias) if the instrument, z, also has little correlation with the 
explanatory variable, x. The same problem can arise in the context of the multiple equation model in 
equation (15.42), whether we have one instrument for y, or more instruments than we need. 

We also mentioned the findings of Staiger and Stock (1997), and we now discuss the practical 
implications of this research in a bit more depth. Importantly, Staiger and Stock study the case of where 
all instrumental variables are exogenous. With the exogeneity requirement satisfied by the instruments, 
they focus on the case where the instruments are weakly correlated with y», and they study the validity 
of standard errors, confidence intervals, and ż statistics involving the coefficient 6; on yz. The mecha- 
nism they used to model weak correlation led to an important finding: even with very large sample 
sizes the 2SLS estimator can be biased and a distribution that is very different from standard normal. 

Building on Staiger and Stock (1997), Stock and Yogo (2005) (SY for short) proposed methods 
for detecting situations where weak instruments will lead to substantial bias and distorted statistical 
inference. Conveniently, Stock and Yogo obtained rules concerning the size of the ż statistic (with one 
instrument) or the F statistic (with more than one instrument) from the first-stage regression. The the- 
ory is much too involved to pursue here. Instead, we describe some simple rules of thumb proposed 
by Stock and Yogo that are easy to implement. 

The key implication of the SY work is that one needs more than just a statistical rejection of the null 
hypothesis in the first stage regression at the usual significance levels. For example, in equation (15.6), it 
is not enough to reject the null hypothesis stated in (15.7) at the 5% significance level. Using bias calcu- 
lations for the instrumental variables estimator, SY recommend that one can proceed with the usual IV 
inference if the first-stage t statistic has absolute value larger than V 10 = 3.2. Readers will recognize 
this value as being well above the 95" percentile of the standard normal distribution, 1.96, which is what 
we would use for a standard 5% significance level. This same rule of thumb applies in the multiple regres- 
sion model with a single endogenous explanatory variable, y), and a single instrumental variable, z,. 
In particular, the ż statistic in testing hypothesis (15.31) should be at least 3.2 in absolute value. 

SY cover the case of 2SLS, too. In this case, we must focus on the first-stage F statistic for 
exclusion of the instrumental variables for y,, and the SY rule is F > 10. (Notice this is the same rule 
based on the ¢ statistic when there is only one instrument, as f = F.) For example, consider equation 
(15.34), where we have two instruments for y2, z2 and zz. Then the F statistic for the null hypothesis 


HA: m, = 0, 73 = 0 


should have F > 10. Remember, this is not the overall F statistic for all of the exogenous variables in 
(15.34). We test only the coefficients on the proposed IVs for y,, that is, the exogenous variables that 
do not appear in (15.22). In Example 15.5 the relevant F statistic is 124.76, which is well above 10, 
implying that we do not have to worry about weak instruments. (Of course, the exogeneity of the par- 
ents’ education variables is in doubt.) 

The rule of thumb of requiring the F statistic to be larger than 10 tends to work well and is easy to 
remember. However, like all rules of thumb involving statistical inference, it makes no sense to use 10 
as a knife-edge cutoff. For example, one can probably proceed if F = 9.94, as it is pretty close to 10. 
The rule of thumb should be used as a guideline. SY have more detailed suggestions for cases where 
there are many instruments for y,, say five or more. 

A more complicated issue is what happens if there is heteroskedasticity in either the equation of 
interest, (15.28), or the reduced form (first stage) for the endogenous explanatory variables, (15.30). 
Stock and Yogo (2005) did not allow for heteroskedasticity in either equation (or, in a time series or 
panel context, serial correlation). It makes sense that the requirements for the first-stage t or F statistic 
would be more stringent. Work by Olea and Pflueger (2013) suggests this is the case: the first-stage F 
might need to be more like 20 rather than 10 in order to ensure the instruments are sufficiently strong. 
This is an ongoing area of research. 
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15-3d Multiple Endogenous Explanatory Variables 


Two stage least squares can also be used in models with more than one endogenous explanatory vari- 
able. For example, consider the model 


yı = Bo + Byz + Bays + B32, + Baza + BszZ3 + u, [15.44] 


where E(u,) = 0 and u is uncorrelated with z,, z2, and z3. The variables y, and y; are endogenous 
explanatory variables: each may be correlated with u. 

To estimate (15.44) by 2SLS, we need at least two exogenous variables that do not appear in 
(15.44) but that are correlated with y, and y3. Suppose we have two excluded exogenous variables, 
say z4 and z;. Then, from our analysis of a single endogenous explanatory variable, we need either 
Z4 Or Zs to appear in each reduced form for y, and y3. (As before, we can use F statistics to test 
this.) Although this is necessary for identification, unfortunately, it is not sufficient. Suppose that z4 
appears in each reduced form, but z; appears in neither. Then, we do not really have two exogenous 
variables partially correlated with y, and y3. Two stage least squares will not produce consistent esti- 
mators of the £. 

Generally, when we have more than one endogenous explanatory variable in a regression model, 
identification can fail in several complicated ways. But we can easily state a necessary condition for 
identification, which is called the order condition. 


Order Condition for Identification of an Equation. We 
need at least as many excluded exogenous variables 
as there are included endogenous explanatory vari- 
ables in the structural equation. The order condition 
is simple to check, as it only involves counting en- 
dogenous and exogenous variables. The sufficient 
condition for identification is called the rank condi- 
tion. We have seen special cases of the rank condition 
before—for example, in the discussion surrounding 
equation (15.35). A general statement of the rank 
condition requires matrix algebra and is beyond the 
scope of this text. [See Wooldridge (2010, Chapter 5).] 
It is even more difficult to obtain diagnostics for weak 
instruments. 


È GOING FURTHER 15.3 


The following model explains violent crime 
rates, at the city level, in terms of a binary 
variable for whether gun control laws exist 
and other controls: 


violent = By + B,guncontro! + Bounem 
+ B,popul + B,percbick 
Bagels 2l Sr asec 


Some researchers have estimated similar 
equations using variables such as the num- 
ber of National Rifle Association members in 
the city and the number of subscribers to gun 
magazines as instrumental variables for gun- 
control [see, for example, Kleck and Patterson 
(1993)]. Are these convincing instruments? 


15-3e Testing Multiple Hypotheses after 2SLS Estimation 


We must be careful when testing multiple hypotheses in a model estimated by 2SLS. It is tempting 
to use either the sum of squared residuals or the R-squared form of the F statistic, as we learned with 
OLS in Chapter 4. The fact that the R-squared in 2SLS can be negative suggests that the usual way of 
computing F statistics might not be appropriate; this is the case. In fact, if we use the 2SLS residu- 
als to compute the SSRs for both the restricted and unrestricted models, there is no guarantee that 
SSR, = SSR,,,; if the reverse is true, the F statistic would be negative. 

It is possible to combine the sum of squared residuals from the second stage regression [such 
as (15.38)] with SSR, to obtain a statistic with an approximate F distribution in large samples. 
Because many econometrics packages have simple-to-use test commands that can be used to 
test multiple hypotheses after 2SLS estimation, we omit the details. Davidson and MacKinnon 
(1993) and Wooldridge (2010, Chapter 5) contain discussions of how to compute F-type statistics 
for 2SLS. 
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15-4 IV Solutions to Errors-in-Variables Problems 


In the previous sections, we presented the use of instrumental variables as a way to solve the omitted 
variables problem, but they can also be used to deal with the measurement error problem. As an illus- 
tration, consider the model 


y = Bo + Bixi + Box, + u, [15.45] 


where y and x, are observed but xj is not. Let x, be an observed measurement of x}: x, = x; + e, 
where e, is the measurement error. In Chapter 9, we showed that correlation between x, and e, causes 
OLS, where x, is used in place of xj, to be biased and inconsistent. We can see this by writing 


y = Bo + Bix, + Box, + (u T Bye). [15.46] 


If the classical errors-in-variables (CEV) assumptions hold, the bias in the OLS estimator of B, is 
toward zero. Without further assumptions, we can do nothing about this. 

In some cases, we can use an IV procedure to solve the measurement error problem. In (15.45), 
we assume that u is uncorrelated with x}, x,, and x; in the CEV case, we assume that e, is uncor- 
related with x; and x,. These imply that x, is exogenous in (15.46), but that x, is correlated with e}. 
What we need is an IV for xı. Such an IV must be correlated with x,, uncorrelated with u—so that it 
can be excluded from (15.45)—and uncorrelated with the measurement error, e}. 

One possibility is to obtain a second measurement on xj, say, zı. Because it is x} that affects y, it 
is only natural to assume that z; is uncorrelated with u. If we write z; = x; + a,, where a, is the meas- 
urement error in z,, then we must assume that a, and e, are uncorrelated. In other words, x, and z, 
both mismeasure x}, but their measurement errors are uncorrelated. Certainly, x, and z; are correlated 
through their dependence on x}, so we can use z; as an IV for x). 

Where might we get two measurements on a variable? Sometimes, when a group of workers is asked 
for their annual salary, their employers can provide a second measure. For married couples, each spouse 
can independently report the level of savings or family income. In the Ashenfelter and Krueger (1994) 
study cited in Section 14-3, each twin was asked about his or her sibling’s years of education; this gives 
a second measure that can be used as an IV for self-reported education in a wage equation. (Ashenfelter 
and Krueger combined differencing and IV to account for the omitted ability problem as well; more on 
this in Section 15-8.) Generally, though, having two measures of an explanatory variable is rare. 

An alternative is to use other exogenous variables as IVs for a potentially mismeasured variable. 
For example, our use of motheduc and fatheduc as IVs for educ in Example 15.5 can serve this pur- 
pose. If we think that educ = educ* + e,, then the IV estimates in Example 15.5 do not suffer from 
measurement error if motheduc and fatheduc are uncorrelated with the measurement error, e}. This is 
probably more reasonable than assuming motheduc and fatheduc are uncorrelated with ability, which 
is contained in u in (15.45). 

IV methods can also be adopted when using things like test scores to control for unobserved 
characteristics. In Section 9-2, we showed that, under certain assumptions, proxy variables can be 
used to solve the omitted variables problem. In Example 9.3, we used IQ as a proxy variable for 
unobserved ability. This simply entails adding IQ to the model and performing an OLS regression. 
But there is an alternative that works when IQ does not fully satisfy the proxy variable assumptions. 
To illustrate, write a wage equation as 


log(wage) = By + Byeduc + B,exper + B3exper? + abil + u, [15.47] 


where we again have the omitted ability problem. But we have two test scores that are indicators of 
ability. We assume that the scores can be written as 


test; = y,abil + e; 
and 


test, = 6,abil + ez, 
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where y, > 0, ô > 0. Since it is ability that affects wage, we can assume that fest, and fest, 
are uncorrelated with u. If we write abil in terms of the first test score and plug the result into (15.47), 
we get 


log(wage) = By + Bieduc + Bexper + Byexper? + atesti + (u — œe), [15.48] 


where a, = 1/y,. Now, if we assume that e; is uncorrelated with all the explanatory variables in 
(15.47), including abil, then e, and test, must be correlated. [Notice that educ is not endogenous 
in (15.48); however, test, is.] This means that estimating (15.48) by OLS will produce inconsistent 
estimators of the £, (and a,). Under the assumptions we have made, fest, does not satisfy the proxy 
variable assumptions. 

If we assume that e, is also uncorrelated with all the explanatory variables in (15.47) and that e; 
and e, are uncorrelated, then e; is uncorrelated with the second test score, fest). Therefore, test, can be 
used as an IV for fest). 


Using Two Test Scores as Indicators of Ability 


We use the data in WAGE2 to implement the preceding procedure, where JQ plays the role of the first 
test score and KWW (knowledge of the world of work) is the second test score. The explanatory vari- 
ables are the same as in Example 9.3: educ, exper, tenure, married, south, urban, and black. Rather 
than adding JQ and doing OLS, as in column (2) of Table 9.2, we add JQ and use KWW as its instru- 
ment. The coefficient on educ is .025 (se = .017). This is a low estimate, and it is not statistically dif- 
ferent from zero. This is a puzzling finding, and it suggests that one of our assumptions fails; perhaps 
e, and e, are correlated. 


15-5 Testing for Endogeneity and Testing Overidentifying Restrictions 


In this section, we describe two important tests in the context of instrumental variables estimation. 


15-5a Testing for Endogeneity 


The 2SLS estimator is less efficient than OLS when the explanatory variables are exogenous; as we 
have seen, the 2SLS estimates can have very large standard errors. Therefore, it is useful to have a 
test for endogeneity of an explanatory variable that shows whether 2SLS is even necessary. Obtaining 
such a test is rather simple. 

To illustrate, suppose we have a single suspected endogenous variable, 


yı = Bo + Biy2 + Boz + Bsz + u, [15.49] 


where z, and z, are exogenous. We have two additional exogenous variables, z3 and z4, which do not 
appear in (15.49). If y, is uncorrelated with u,, we should estimate (15.49) by OLS. How can we test 
this? Hausman (1978) suggested directly comparing the OLS and 2SLS estimates and determining 
whether the differences are statistically significant. After all, both OLS and 2SLS are consistent if all 
variables are exogenous. If 2SLS and OLS differ significantly, we conclude that y, must be endog- 
enous (maintaining that the z; are exogenous). 

It is a good idea to compute OLS and 2SLS to see if the estimates are practically different. To 
determine whether the differences are statistically significant, it is easier to use a regression test. This 
is based on estimating the reduced form for y), which in this case is 


Ya = Wy + WZ + Wok + Maza + Tg + Vd. [15.50] 
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Now, since each z; is uncorrelated with u,, y, is uncorrelated with u, if, and only if, v, is uncorrelated 
with u; this is what we wish to test. Write u) = ôv) + e,, where e, is uncorrelated with v, and has 
zero mean. Then, u, and v, are uncorrelated if, and only if, 6, = 0. The easiest way to test this is 
to include v, as an additional regressor in (15.49) and to do a f test. There is only one problem with 
implementing this: v, is not observed, because it is the error term in (15.50). Because we can esti- 
mate the reduced form for y, by OLS, we can obtain the reduced form residuals, >. Therefore, we 
estimate 


Yı = Bo + Biya + Boz + B3z2 + 8,0, + error [15.51] 


by OLS and test Hp: 6; = 0 using a ¢ statistic. If we reject Hy at a small significance level, we con- 
clude that y, is endogenous because v, and uw, are correlated. 


Testing for Endogeneity of a Single Explanatory Variable: 

(i) Estimate the reduced form for y, by regressing it on all exogenous variables (including those 
in the structural equation and the additional IVs). Obtain the residuals, 5. 

(ii) Add ?, to the structural equation (which includes y,) and test for significance of Ŷ, using 
an OLS regression. If the coefficient on >, is statistically different from zero, we conclude that y, is 
indeed endogenous. We might want to use a heteroskedasticity-robust f test. 


Return to Education for Working Women 


We can test for endogeneity of educ in (15.40) by obtaining the residuals >, from estimating the 
reduced form (15.41)—using only working women—and including these in (15.40). When we do 
this, the coefficient on ?, is ô ı = .058, and t = 1.67. This is moderate evidence of positive correlation 
between u; and vz. It is probably a good idea to report both estimates because the 2SLS estimate of 
the return to education (6.1%) is well below the OLS estimate (10.8%). 


An interesting feature of the regression from step (ii) of the test for endogeneity is that the coeffi- 
cient estimates on all explanatory variables (except, of course, ») are identical to the 2SLS estimates. For 
example, estimating (15.51) by OLS produces the same Ê; as estimating (15.49) by 2SLS. One benefit 
of this equivalence is that it provides an easy check on whether you have done the proper regression in 
testing for endogeneity. But it also gives a different, useful interpretation of 2SLS: adding 1, to the origi- 
nal equation as an explanatory variable, and applying OLS, clears up the endogeneity of y2. So, when 
we start by estimating (15.49) by OLS, we can quantify the importance of allowing y, to be endogenous 
by seeing how much B ı changes when Ŷ, is added to the equation. Irrespective of the outcome of the 
statistical tests, we can see whether the change in B ı is expected and is practically significant. 

If, in the end, the 2SLS estimates are chosen, one should obtain the standard errors using built-in 
2SLS routines rather than those from regression (15.51). The standard errors obtained from the OLS 
regression (15.51) are valid only under the null hypothesis 6, = 0. 

We can also test for endogeneity of multiple explanatory variables. For each suspected endog- 
enous variable, we obtain the reduced form residuals, as in part (i). Then, we test for joint significance 
of these residuals in the structural equation, using an F test. Joint significance indicates that at least 
one suspected explanatory variable is endogenous. The number of exclusion restrictions tested is the 
number of suspected endogenous explanatory variables. 


15-5b Testing Overidentification Restrictions 


When we introduced the simple instrumental variables estimator in Section 15-1, we emphasized that 
the instrument must satisfy two requirements: it must be uncorrelated with the error (exogeneity) and 
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correlated with the endogenous explanatory variable (relevance). We have now seen that, even in models 
with additional explanatory variables, the second requirement can be tested using a ¢ test (with just one 
instrument) or an F test (when there are multiple instruments). In the context of the simple IV estimator, 
we noted that the exogeneity requirement cannot be tested. However, if we have more instruments than 
we need, we can effectively test whether some of them are uncorrelated with the structural error. 

As a specific example, again consider equation (15.49) with two instrumental variables for y», Z3, 
and z,. Remember, z, and z, essentially act as their own instruments. Because we have two instru- 
ments for y), we can estimate (15.49) using, say, only z; as an IV for y,; let By be the resulting IV 
estimator of B,. Then, we can estimate (15.49) using only z4 as an IV for y,; call this IV estimator B,. 
If all z; are exogenous, and if z, and z4 are each partially correlated with y, then By and , are both 
consistent for 64. Therefore, if our logic for choosing the instruments is sound, B , and f, should differ 
only by sampling error. Hausman (1978) proposed basing a test of whether z; and z4 are both exog- 
enous on the difference, Bi - Bi. Shortly, we will provide a simpler way to obtain a valid test, but, 
before doing so, we should understand how to interpret the outcome of the test. 

If we conclude that B , and B, are statistically different from one another, then we have no choice 
but to conclude that either z3, z4, or both fail the exogeneity requirement. Unfortunately, we cannot 
know which is the case (unless we simply assert from the beginning that, say, z, is exogenous). For 
example, if y, denotes years of schooling in a log wage equation, z; is mother’s education, and z4 is 
father’s education, a statistically significant difference in the two IV estimators implies that one or 
both of the parents’ education variables are correlated with u in (15.54). 

Certainly, rejecting that one’s instruments are exogenous is serious and requires a new approach. 
But the more serious, and subtle, problem in comparing IV estimates is that they may be similar even 
though both instruments fail the exogeneity requirement. In the previous example, it seems likely 
that if mother’s education is positively correlated with u,, then so is father’s education. Therefore, the 
two IV estimates may be similar even though each is inconsistent. In effect, because the IVs in this 
example are chosen using similar reasoning, their separate use in IV procedures may very well lead 
to similar estimates that are nevertheless both inconsistent. The point is that we should not feel espe- 
cially comfortable if our IV procedures pass the Hausman test. 

Another problem with comparing two IV estimates is that often they may seem practically dif- 
ferent yet, statistically, we cannot reject the null hypothesis that they are consistent for the same 
population parameter. For example, in estimating (15.40) by IV using motheduc as the only instru- 
ment, the coefficient on educ is .049 (.037). If we use only fatheduc as the IV for educ, the coefficient 
on educ is .070 (.034). [Perhaps not surprisingly, the estimate using both parents’ education as IVs is 
in between these two, .061 (.031).] For policy purposes, the difference between 5% and 7% for the 
estimated return to a year of schooling is substantial. Yet, as shown in Example 15.8, the difference is 
not statistically significant. 

The procedure of comparing different IV estimates of the same parameter is an example of test- 
ing overidentifying restrictions. The general idea is that we have more instruments than we need to 
estimate the parameters consistently. In the previous example, we had one more instrument than we 
need, and this results in one overidentifying restriction that can be tested. In the general case, sup- 
pose that we have g more instruments than we need. For example, with one endogenous explanatory 
variable, y,, and three proposed instruments for y,, we have q = 3 — 1 = 2 overidentifying restric- 
tions. When q is two or more, comparing several IV estimates is cumbersome. Instead, we can easily 
compute a test statistic based on the 2SLS residuals. The idea is that, if all instruments are exog- 
enous, the 2SLS residuals should be uncorrelated with the instruments, up to sampling error. But if 
there are k + 1 parameters and k + 1 + q instruments, the 2SLS residuals have a zero mean and are 
identically uncorrelated with k linear combinations of the instruments. (This algebraic fact contains, 
as a special case, the fact that the OLS residuals have a zero mean and are uncorrelated with the k 
explanatory variables.) Therefore, the test checks whether the 2SLS residuals are correlated with q 
linear functions of the instruments, and we need not decide on the functions; the test does that for us 
automatically. 
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The following regression-based test is valid when the homoskedasticity assumption, listed as 
Assumption 2SLS.5 in the chapter appendix, holds. 


Testing Overidentifying Restrictions: 

(i) Estimate the structural equation by 2SLS and obtain the 2SLS residuals, i,. 

(ii) Regress uy on all exogenous variables. Obtain the R-squared, say, RẸ. 

(iii) Under the null hypothesis that all IVs are uncorrelated with u, nR? ~ X A where q is the 
number of instrumental variables from outside the model minus the total number of endogenous 
explanatory variables. If nRî exceeds (say) the 5% critical value in the xX distribution, we reject Ho 
and conclude that at least some of the IVs are not exogenous. 


EXAMPLE 15.8 Return to Education for Working Women 


When we use motheduc and fatheduc as IVs for educ in (15.40), we have a single overidentifying 
restriction. Regressing the 2SLS residuals @, on exper, exper’, motheduc, and fatheduc produces 
Rî = .0009. Therefore, nR? = 428(.0009) = .3852, which is a very small value in a yj distribution 
(p-value = .535). Therefore, the parents’ education variables pass the overidentification test. When 
we add husband’s education to the IV list, we get two overidentifying restrictions, and nR? = 1.11 
(p-value = .574). Subject to the preceding cautions, it seems reasonable to add huseduc to the IV 
list, as this reduces the standard error of the 2SLS estimate: the 2SLS estimate on educ using all three 
instruments is .080 (se = .022), so this makes educ much more significant than when huseduc is not 
used as an IV (Boiye = -061, se = .031). 


When q = 1, a natural question is: How does the test obtained from the regression-based proce- 
dure compare with a test based on directly comparing the estimates? In fact, the two procedures are 
asymptotically the same. As a practical matter, it makes sense to compute the two IV estimates to see 
how they differ. More generally, when g = 2, one can compare the 2SLS estimates using all IVs to 
the IV estimates using single instruments. By doing so, one can see if the various IV estimates are 
practically different, whether or not the overidentification test rejects or fails to reject. 

In the previous example, we alluded to a general fact about 2SLS: under the standard 2SLS assump- 
tions, adding instruments to the list improves the asymptotic efficiency of the 2SLS. But this requires that 
any new instruments are in fact exogenous—otherwise, 2SLS will not even be consistent—and it is only an 
asymptotic result. With the typical sample sizes available, adding too many instruments—that is, increas- 
ing the number of overidentifying restrictions—can cause severe biases in 2SLS. A detailed discussion 
would take us too far afield. A nice illustration is given by Bound, Jaeger, and Baker (1995), who argue 
that the 2SLS estimates of the return to education obtained by Angrist and Krueger (1991), using many 
instrumental variables, are likely to be seriously biased (even with hundreds of thousands of observations!). 

The overidentification test can be used whenever we have more instruments than we need. If we have 
just enough instruments, the model is said to be just identified, and the R-squared in part (ii) will be identi- 
cally zero. As we mentioned earlier, we cannot test exogeneity of the instruments in the just identified case. 

The test can be made robust to heteroskedasticity of arbitrary form; for details, see Wooldridge 
(2010, Chapter 5). 


15-6 2SLS with Heteroskedasticity 


Heteroskedasticity in the context of 2SLS raises essentially the same issues as with OLS. Most impor- 
tantly, it is possible to obtain standard errors and test statistics that are (asymptotically) robust to 
heteroskedasticity of arbitrary and unknown form. In fact, expression (8.4) continues to be valid if the 
7, are obtained as the residuals from regressing ĉ; on the other %;,, where the “~” denotes fitted values 
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from the first stage regressions (for endogenous explanatory variables). Wooldridge (2010, Chapter 5) 
contains more details. Some software packages do this routinely. 

We can also test for heteroskedasticity, using an analog of the Breusch-Pagan test that we covered 
in Chapter 8. Let # denote the 2SLS residuals and let z1, Z2, - - - , Zm denote all the exogenous vari- 
ables (including those used as IVs for the endogenous explanatory variables). Then, under reasonable 
assumptions [spelled out, for example, in Wooldridge (2010, Chapter 5)], an asymptotically valid 
statistic is the usual F statistic for joint significance in a regression of # on Z4, z,..., Zm. The null 
hypothesis of homoskedasticity is rejected if the z; are jointly significant. 

If we apply this test to Example 15.8, using motheduc, fatheduc, and huseduc as instruments for 
educ, we obtain F's 455 = 2.53 and p-value = .029. This is evidence of heteroskedasticity at the 5% 
level. We might want to compute heteroskedasticity-robust standard errors to account for this. 

If we know how the error variance depends on the exogenous variables, we can use a 
weighted 2SLS procedure, essentially the same as in Section 8-4. After estimating a model for 
Var(ulz), Z» - - - , Zm), we divide the dependent variable, the explanatory variables, and all the instru- 
mental variables for observation i by h, where h, denotes the estimated variance. (The constant, 
which is both an explanatory variable and an IV, is divided by Vin: see Section 8-4.) Then, we apply 
2SLS on the transformed equation using the transformed instruments. 


15-7 Applying 2SLS to Time Series Equations 


When we apply 2SLS to time series data, many of the considerations that arose for OLS in Chapters 
10, 11, and 12 are relevant. Write the structural equation for each time period as 


Yı = Bo + Bixa to + BX + Up 


where one or more of the explanatory variables x, might be correlated with u,. Denote the set of exog- 
enous variables by z,,,. 


[15.52] 


seo Itm: 
E(u,) = 0, Cov(z,, u) = 0, 


Any exogenous explanatory variable is also a z,. For 
identification, it is necessary that m = k (we have as 


Cal 


a GOING FURTHER 15.4 


A model to test the effect of growth in gov- 
ernment spending on growth in output is 


gGDP, = Bo + B1gGOV, + BolNVRAT, 
+ BogLAB, + Up 


where g indicates growth, GDP is real gross 
domestic product, GOV is real government 
spending, INVRAT is the ratio of gross do- 
mestic investment to GDP, and LAB is the 
size of the labor force. [See equation (6) 
in Ram (1986).] Under what assumptions 
would a dummy variable indicating whether 
the president in year t — 1 is a Republican 
be a suitable IV for gGOV;? 


many exogenous variables as explanatory variables). 

The mechanics of 2SLS are identical for time 
series or cross-sectional data, but for time series 
data the statistical properties of 2SLS depend on 
the trending and correlation properties of the under- 
lying sequences. In particular, we must be careful 
to include trends if we have trending dependent or 
explanatory variables. Since a time trend is exoge- 
nous, it can always serve as its own instrumental var- 
iable. The same is true of seasonal dummy variables, 
if monthly or quarterly data are used. 

Series that have strong persistence (have unit 
roots) must be used with care, just as with OLS. 
Often, differencing the equation is warranted before 
estimation, and this applies to the instruments as well. 


Under analogs of the assumptions in Chapter 11 for the asymptotic properties of OLS, 2SLS 
using time series data is consistent and asymptotically normally distributed. In fact, if we replace the 
explanatory variables with the instrumental variables in stating the assumptions, we only need to add 
the identification assumptions for 2SLS. For example, the homoskedasticity assumption is stated as 


E(uflzn - - 


> Zim) = 07, [15.53] 
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and the no serial correlation assumption is stated as 


E(uu,|Z,Z,) =0 forallt #s, [15.54] 


where z, denotes all exogenous variables at time t. A full statement of the assumptions is given in the 
chapter appendix. We will provide examples of 2SLS for time series problems in Chapter 16; see also 
Computer Exercise C4. 

As in the case of OLS, the no serial correlation assumption can often be violated with time series 
data. Fortunately, it is very easy to test for AR(1) serial correlation. If we write u, = pu,_, + e, and 
plug this into equation (15.52), we get 


Yy, = Bo + Bixa +> + Beta + pu- + e, t = 2. [15.55] 


To test Ho: p, = 0, we must replace u,_, with the 2SLS residuals, #,_ ,. Further, if x, is endogenous in 
(15.52), then it is endogenous in (15.55), so we still need to use an IV. Because e, is uncorrelated with 
all past values of u, ii, ; can be used as its own instrument. 


Testing for AR(1) Serial Correlation after 2SLS: 
(i) Estimate (15.52) by 2SLS and obtain the 2SLS residuals, ĉ,. 
(ii) Estimate 
y, = Bo + Bixa + + Bx + pit, + error, t= 2,...,n 


by 2SLS, using the same instruments from part (i), in addition to i,_ ,. Use the ż statistic on p to test 
Ho: p = 0. 

As with the OLS version of this test from Chapter 12, the f statistic only has asymptotic justifica- 
tion, but it tends to work well in practice. A heteroskedasticity-robust version can be used to guard 
against heteroskedasticity. Further, lagged residuals can be added to the equation to test for higher 
forms of serial correlation using a joint F test. 

What happens if we detect serial correlation? Some econometrics packages will compute standard 
errors that are robust to fairly general forms of serial correlation and heteroskedasticity. This is a nice, 
simple way to go if your econometrics package does this. The computations are very similar to those 
in Section 12-5 for OLS. [See Wooldridge (1995) for formulas and other computational methods. ] 

An alternative is to use the AR(1) model and correct for serial correlation. The procedure is 
similar to that for OLS and places additional restrictions on the instrumental variables. The quasi- 
differenced equation is the same as in equation (12.32): 


y= Bo(1 = p) + Bifa to + Pwt en t=2, [15.56] 


where X; = Xj — PX,-1,j (We can use the t = 1 observation just as in Section 12-3, but we omit that for 
simplicity here.) The question is: What can we use as instrumental variables? It seems natural to use the 
quasi-differenced instruments, Z; = Zj — pZ,—,,;- This only works, however, if in (15.52) the original 
error u, is uncorrelated with the instruments at times ¢,¢ — 1, andt + 1. That is, the instrumental variables 
must be strictly exogenous in (15.52). This rules out lagged dependent variables as IVs, for example. It 
also eliminates cases where future movements in the IVs react to current and past changes in the error, u,. 


2SLS with AR(1) Errors: 
(i) Estimate (15.52) by 2SLS and obtain the 2SLS residuals, ĉ, t = 1,2,...,n. 
(ii) Obtain p from the regression of #, on i,_, t = 2,..., n and construct the quasi-differenced vari- 
ables Ñ, = y, — ÊYr-1 Xj = Xy — PX,—1,;, and Zj = Zy — P%-—1,; for t = 2. (Remember, in most 


cases, some of the IVs will also be explanatory variables.) 

(iii) Estimate (15.56) (where p is replaced with p) by 2SLS, using the Z, as the instruments. 
Assuming that (15.56) satisfies the 2SLS assumptions in the chapter appendix, the usual 2SLS 
test statistics are asymptotically valid. 
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We can also use the first time period as in Prais-Winsten estimation of the model with exogenous 
explanatory variables. The transformed variables in the first time period—the dependent variable, 
explanatory variables, and instrumental variables—are obtained simply by multiplying all first-period 
values by (1 — f)"”. (See also Section 12-3.) 


15-8 Applying 2SLS to Pooled Cross Sections and Panel Data 


Applying instrumental variables methods to independently pooled cross sections raises no new dif- 
ficulties. As with models estimated by OLS, we should often include time period dummy variables to 
allow for aggregate time effects. These dummy variables are exogenous—because the passage of time 
is exogenous—and so they act as their own instruments. 


Effect of Education on Fertility 


In Example 13.1, we used the pooled cross section in FERTIL1 to estimate the effect of education on 
women’s fertility, controlling for various other factors. As in Sander (1992), we allow for the possibil- 
ity that educ is endogenous in the equation. As instrumental variables for educ, we use mother’s and 
father’s education levels (meduc, feduc). The 2SLS estimate of Begue iS —.153 (se = .039), compared 
with the OLS estimate —.128 (se = .018). The 2SLS estimate shows a somewhat larger effect of 
education on fertility, but the 2SLS standard error is over twice as large as the OLS standard error. 
(In fact, the 95% confidence interval based on 2SLS easily contains the OLS estimate.) The OLS 
and 2SLS estimates of Bau: are not statistically different, as can be seen by testing for endogeneity 
of educ as in Section 15-5: when the reduced form residual, 15, is included with the other regressors 
in Table 13.1 (including educ), its t statistic is .702, which is not significant at any reasonable level. 
Therefore, in this case, we conclude that the difference between 2SLS and OLS could be entirely due 
to sampling error. 


Instrumental variables estimation can be combined with panel data methods, particularly first 
differencing, to estimate parameters consistently in the presence of unobserved effects and endogene- 
ity in one or more time-varying explanatory variables. The following simple example illustrates this 
combination of methods. 


Job Training and Worker Productivity 


Suppose we want to estimate the effect of another hour of job training on worker productivity. For the 
two years 1987 and 1988, consider the simple panel data model 


log(scrap;,) = Bo + 5pd88, + Byhrsemp;, + a; + ui t = 1, 2, 


where scrap; is firm i’s scrap rate in year t and hrsemp;, is hours of job training per employee. As 
usual, we allow different year intercepts and a constant, unobserved firm effect, a;. 

For the reasons discussed in Section 13-2, we might be concerned that hrsemp,, is correlated with a,, 
the latter of which contains unmeasured worker ability. As before, we difference to remove a; 


Alog(scrap;) = 59 + B,Ahrsemp, + Au;. [15.57] 


Normally, we would estimate this equation by OLS. But what if Au; is correlated with Ahrsemp;? For 
example, a firm might hire more skilled workers, while at the same time reducing the level of job train- 
ing. In this case, we need an instrumental variable for Ahrsemp,. Generally, such an IV would be hard to 
find, but we can exploit the fact that some firms received job training grants in 1988. If we assume that 
grant designation is uncorrelated with Au;—something that is reasonable, because the grants were given 
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at the beginning of 1988—then Agrant; is valid as an IV, provided Ahrsemp and Agrant are correlated. 
Using the data in JTRAIN differenced between 1987 and 1988, the first stage regression is 
Ahrsemp = .51 + 27.88 Agrant 
(1.56) (3.13) 

n = 45, R? = .392. 
This confirms that the change in hours of job training per employee is strongly positively related to 
receiving a job training grant in 1988. In fact, receiving a job training grant increased per-employee 
training by almost 28 hours, and grant designation accounted for almost 40% of the variation in 
Ahrsemp. Two stage least squares estimation of (15.57) gives 

Alog(scrap) = —.033 — .014 Ahrsemp 
(.127) (.008) 
n = 45, R? = 016. 


This means that 10 more hours of job training per worker are estimated to reduce the scrap rate 
by about 14%. For the firms in the sample, the average amount of job training in 1988 was about 
17 hours per worker, with a minimum of zero and a maximum of 88. 

For comparison, OLS estimation of (15.57) gives Êi = —.0076 (se = .0045), so the 2SLS esti- 
mate of 6, is almost twice as large in magnitude and is slightly more statistically significant. 


When T = 3, the differenced equation may contain serial correlation. The same test and cor- 
rection for AR(1) serial correlation from Section 15-7 can be used, where all regressions are pooled 
across i as well as t. Because we do not want to lose an entire time period, the Prais-Winsten transfor- 
mation should be used for the initial time period. 

Unobserved effects models containing lagged dependent variables also require IV methods for 
consistent estimation. The reason is that, after differencing, Ay, ,_, is correlated with Au; because 
Yiy-1 and u;,—, are correlated. We can use two or more lags of y as IVs for Ay, ,_,. [See Wooldridge 
(2010, Chapter 11) for details.] 

Instrumental variables after differencing can be used on matched pairs samples as well. Ashenfelter 
and Krueger (1994) differenced the wage equation across twins to eliminate unobserved ability: 


log(wage,) — log(wage,) = ô + B,(educy2 — educ,,) + (u, — u), 


where educ,,, is years of schooling for the first twin as reported by the first twin and educ, , is years 
of schooling for the second twin as reported by the second twin. To account for possible measure- 
ment error in the self-reported schooling measures, Ashenfelter and Krueger used (educ,, — educi 3) 
as an IV for (educ, — educ; 1), where educ, is years of schooling for the second twin as reported 
by the first twin and educ; > is years of schooling for the first twin as reported by the second twin. 
The IV estimate of B, is .167(t = 3.88), compared with the OLS estimate on the first differences of 
.092(t = 3.83) [see Ashenfelter and Krueger (1994, Table 3)]. 


Summary 


In Chapter 15, we have introduced the method of instrumental variables as a way to estimate the param- 
eters in a linear model consistently when one or more explanatory variables are endogenous. An instrumen- 
tal variable must have two properties: (1) it must be exogenous, that is, uncorrelated with the error term of 
the structural equation; (2) it must be partially correlated with the endogenous explanatory variable. Find- 
ing a variable with these two properties is usually challenging. 
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The method of two stage least squares, which allows for more instrumental variables than we have 
explanatory variables, is used routinely in the empirical social sciences. When used properly, it can allow 
us to estimate ceteris paribus effects in the presence of endogenous explanatory variables. This is true in 
cross-sectional, time series, and panel data applications. But when instruments are poor—which means 
they are correlated with the error term, only weakly correlated with the endogenous explanatory variable, 
or both—then 2SLS can be worse than OLS. 

When we have valid instrumental variables, we can test whether an explanatory variable is endog- 
enous, using the test in Section 15-5. In addition, though we can never test whether all IVs are exogenous, 
we can test that at least some of them are—assuming that we have more instruments than we need for 
consistent estimation (that is, the model is overidentified). Heteroskedasticity and serial correlation can be 
tested for and dealt with using methods similar to the case of models with exogenous explanatory variables. 

In this chapter, we used omitted variables and measurement error to illustrate the method of instru- 
mental variables. IV methods are also indispensable for simultaneous equations models, which we will 
cover in Chapter 16. 


Key Terms 


Endogenous Explanatory Instrument Order Condition 

Variables Instrumental Variable Overidentifying Restrictions 
Errors-in- Variables Instrumental Variables (IV) Rank Condition 
Exclusion Restrictions Estimator Reduced Form Equation 
Exogenous Explanatory Variables Instrument Exogeneity Structural Equation 
Exogenous Variables Instrument Relevance Two Stage Least Squares (2SLS) 
First Stage Natural Experiment Estimator 
Identification Omitted Variables Weak Instruments 


Problems 


1 Consider a simple model to estimate the effect of personal computer (PC) ownership on college grade 
point average for graduating seniors at a large public university: 


GPA = By) + B,PC + u, 


where PC is a binary variable indicating PC ownership. 

(i) | Why might PC ownership be correlated with u? 

(i) Explain why PC is likely to be related to parents’ annual income. Does this mean parental 
income is a good IV for PC? Why or why not? 

(iii) Suppose that, four years ago, the university gave grants to buy computers to roughly one- 
half of the incoming students, and the students who received grants were randomly chosen. 
Carefully explain how you would use this information to construct an instrumental variable 
for PC. 


2 Suppose that you wish to estimate the effect of class attendance on student performance, as in 
Example 6.3. A basic model is 


stndfnl = By + B,atndrte + B,priGPA + BACT + u, 


where the variables are defined as in Chapter 6. 

(i) Let dist be the distance from the students’ living quarters to the lecture hall. Do you think dist is 
uncorrelated with u? 

(i) Assuming that dist and u are uncorrelated, what other assumption must dist satisfy to be a valid 
IV for atndrte? 
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(iii) 


Suppose, as in equation (6.18), we add the interaction term priGPA-atndrte: 
stndfnl = By + B,atndrte + B,priGPA + B,ACT + BypriGPA-atndrte + u. 


If atndrte is correlated with u, then, in general, so is priGPA-atndrte. What might be a good IV for 
priGPA-atndrte? (Hint: If E(ulpriGPA, ACT, dist) = 0, as happens when priGPA, ACT, and dist are all 
exogenous, then any function of priGPA and dist is uncorrelated with u.] 


Consider the simple regression model 


y=Pot+Pxtu 
and let z be a binary instrumental variable for x. Use (15.10) to show that the IV estimator Bi can be 
written as 


Bi = (Yi — Yo) — 7o), 
where yo and xo are the sample averages of y; and x; over the part of the sample with z; = 0, and where 


yı and x, are the sample averages of y; and x; over the part of the sample with z; = 1. This estimator, 
known as a grouping estimator, was first suggested by Wald (1940). 


Suppose that, for a given state in the United States, you wish to use annual time series data to estimate 
the effect of the state-level minimum wage on the employment of those 18 to 25 years old (EMP). A 
simple model is 


gEMP, = By + B,gMIN, + Bo.gPOP, + B3gGSP, + BugGDP, + u, 


where MIN, is the minimum wage, in real dollars; POP, is the population from 18 to 25 years old; GSP, 

is gross state product; and GDP, is U.S. gross domestic product. The g prefix indicates the growth rate 

from year t — 1 to year t, which would typically be approximated by the difference in the logs. 

(i) If we are worried that the state chooses its minimum wage partly based on unobserved (to us) 
factors that affect youth employment, what is the problem with OLS estimation? 

(ii) Let USMIN, be the U.S. minimum wage, which is also measured in real terms. Do you think 
gUSMIN, is uncorrelated with u,? 

(iii) By law, any state’s minimum wage must be at least as large as the U.S. minimum. Explain why 
this makes gUSMIN, a potential IV candidate for gMIN,. 


Refer to equations (15.19) and (15.20). Assume that o,, = g, so that the population variation in the 

error term is the same as it is in x. Suppose that the instrumental variable, z, is slightly correlated with 

u: Corr(z, u) = .1. Suppose also that z and x have a somewhat stronger correlation: Corr(z, x) = .2. 

(i) | What is the asymptotic bias in the IV estimator? 

(ii) How much correlation would have to exist between x and u before OLS has more asymptotic 
bias than 2SLS? 


(i) In the model with one endogenous explanatory variable, one exogenous explanatory variable, 
and one extra exogenous variable, take the reduced form for y (15.26), and plug it into the struc- 
tural equation (15.22). This gives the reduced form for y,: 


Yi = Ay + YZ + AZ + vi. 


Find the a; in terms of the £; and the 7;. 
(ii) Find the reduced form error, v}, in terms of u,, v2, and the parameters. 
(iii) How would you consistently estimate the a;? 


The following is a simple model to measure the effect of a school choice program on standardized test 
performance [see Rouse (1998) for motivation and Computer Exercise C11 for an analysis of a subset 
of Rouse’s data]: 


score = By + B choice + B,faminc + uy, 
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where score is the score on a statewide test, choice is a binary variable indicating whether a student 

attended a choice school in the last year, and faminc is family income. The IV for choice is grant, the 

dollar amount granted to students to use for tuition at choice schools. The grant amount differed by 

family income level, which is why we control for faminc in the equation. 

Gj) Even with faminc in the equation, why might choice be correlated with u,? 

(ii) If within each income class, the grant amounts were assigned randomly, is grant uncorrelated with u? 

(iii) Write the reduced form equation for choice. What is needed for grant to be partially correlated 
with choice? 

(iv) Write the reduced form equation for score. Explain why this is useful. (Hint: How do you inter- 
pret the coefficient on grant?) 


8 Suppose you want to test whether girls who attend a girls’ high school do better in math than girls who 
attend coed schools. You have a random sample of senior high school girls from a state in the United 
States, and score is the score on a standardized math test. Let girlhs be a dummy variable indicating 
whether a student attends a girls’ high school. 

i) | What other factors would you control for in the equation? (You should be able to reasonably 
collect data on these factors.) 

(Gi) Write an equation relating score to girlhs and the other factors you listed in part (i). 

(iii) Suppose that parental support and motivation are unmeasured factors in the error term in 
part (ii). Are these likely to be correlated with girlhs? Explain. 

(iv) Discuss the assumptions needed for the number of girls’ high schools within a 20-mile radius of 
a girl’s home to be a valid IV for girlhs. 

(v) Suppose that, when you estimate the reduced form for girlshs, you find that the coefficient on 
numghs (the number of girls’ high schools within a 20-mile radius) is negative and statistically 
significant. Would you feel comfortable proceeding with IV estimation where numghs is used as 
an IV for girlshs? Explain. 


9 Suppose that, in equation (15.8), you do not have a good instrumental variable candidate for skipped. 
But you have two other pieces of information on students: combined SAT score and cumulative GPA 
prior to the semester. What would you do instead of IV estimation? 


10 Ina recent article, Evans and Schwab (1995) studied the effects of attending a Catholic high school on 
the probability of attending college. For concreteness, let college be a binary variable equal to unity if a 
student attends college, and zero otherwise. Let CathHS be a binary variable equal to one if the student 
attends a Catholic high school. A linear probability model is 


college = By + B,CathHS + other factors + u, 


where the other factors include gender, race, family income, and parental education. 

(i) Why might CathHS be correlated with u? 

Gi) Evans and Schwab have data on a standardized test score taken when each student was a sopho- 
more. What can be done with this variable to improve the ceteris paribus estimate of attending a 
Catholic high school? 

(iii) Let CathRel be a binary variable equal to one if the student is Catholic. Discuss the two require- 
ments needed for this to be a valid IV for CathHS in the preceding equation. Which of these can 
be tested? 

(iv) Not surprisingly, being Catholic has a significant positive effect on attending a Catholic high 
school. Do you think CathRel is a convincing instrument for CathHS? 


11 Consider a simple time series model where the explanatory variable has classical measurement error: 


Yy: = Bo + Bix; + u, [15.58] 


x, =x te, 


526 PART3 Advanced Topics 


where u, has zero mean and is uncorrelated with x; and e,. We observe y, and x, only. Assume that e, has 


zero mean and is uncorrelated with x; and that x; also has a zero mean (this last assumption is only to 


simplify the algebra). 

(i) | Write x; = x, — e, and plug this into (15.58). Show that the error term in the new equation, say, 
v, is negatively correlated with x, if 8; > 0. What does this imply about the OLS estimator of 
GB, from the regression of y, on x,? 

(ii) In addition to the previous assumptions, assume that u, and e, are uncorrelated with all past 
values of x* and e,; in particular, with x*,_, and e,_ . Show that E(x,_,v,) = 0 where v, is the 
error term in the model from part (1). 

(iii) Are x, and x,_, likely to be correlated? Explain. 

(iv) What do parts (ii) and (iii) suggest as a useful strategy for consistently estimating Bọ and B,? 


Computer Exercises 


C1 Use the data in WAGE2 for this exercise. 


(i) 


(ii) 


(iii) 
(iv) 


(v) 


(vi) 


In Example 15.2, if sibs is used as an instrument for educ, the IV estimate of the return to 
education is .122. To convince yourself that using sibs as an IV for educ is not the same as just 
plugging sibs in for educ and running an OLS regression, run the regression of log(wage) on 
sibs and explain your findings. 

The variable brthord is birth order (brthord is one for a first-born child, two for a second-born 
child, and so on). Explain why educ and brthord might be negatively correlated. Regress educ 
on brthord to determine whether there is a statistically significant negative correlation. 

Use brthord as an IV for educ in equation (15.1). Report and interpret the results. 

Now, suppose that we include number of siblings as an explanatory variable in the wage 
equation; this controls for family background, to some extent: 


log(wage) = By + B,educ + B,sibs + u. 


Suppose that we want to use brthord as an IV for educ, assuming that sibs is exogenous. The 
reduced form for educ is 


educ = Ta + a,sibs + m brthord + v. 


State and test the identification assumption. 

Estimate the equation from part (iv) using brthord as an IV for educ (and sibs as its own IV). 
Comment on the standard errors for Beau and Be. 

Using the fitted values from part (iv), educ, compute the correlation between educ and sibs. Use 
this result to explain your findings from part (v). 


C2 The data in FERTIL2 include, for women in Botswana during 1988, information on number of chil- 


dren, years of education, age, and religious and economic status variables. 


(i) 


(ii) 


(iii) 


Estimate the model 


children = By + Byeduc + Bage + Bage + u 


by OLS and interpret the estimates. In particular, holding age fixed, what is the estimated effect 
of another year of education on fertility? If 100 women receive another year of education, how 
many fewer children are they expected to have? 

The variable frsthalf is a dummy variable equal to one if the woman was born during the 
first six months of the year. Assuming that frsthalf is uncorrelated with the error term from 
part (i), show that frsthalf is a reasonable IV candidate for educ. (Hint: You need to do a 
regression.) 

Estimate the model from part (i) by using frsthalf as an IV for educ. Compare the estimated 
effect of education with the OLS estimate from part (i). 


C3 


C4 


C5 


C6 
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(iv) Add the binary variables electric, tv, and bicycle to the model and assume these are exogenous. 
Estimate the equation by OLS and 2SLS and compare the estimated coefficients on educ. Interpret 
the coefficient on fv and explain why television ownership has a negative effect on fertility. 


Use the data in CARD for this exercise. 
(i) The equation we estimated in Example 15.4 can be written as 


log(wage) = By + Bieduc + Brexper +--+ + u, 


where the other explanatory variables are listed in Table 15.1. In order for IV to be consistent, 
the IV for educ, nearc4, must be uncorrelated with u. Could nearc4 be correlated with things in 
the error term, such as unobserved ability? Explain. 

(ii) For a subsample of the men in the data set, an IQ score is available. Regress JQ on nearc4 to 
check whether average IQ scores vary by whether the man grew up near a four-year college. 
What do you conclude? 

(iii) Now, regress JQ on nearc4, smsa66, and the 1966 regional dummy variables reg662, .. . , 
reg669. Are IQ and nearc4 related after the geographic dummy variables have been partialled 
out? Reconcile this with your findings from part (ii). 

(iv) From parts (ii) and (iii), what do you conclude about the importance of controlling for smsa66 
and the 1966 regional dummies in the log(wage) equation? 


Use the data in INTDEF for this exercise. A simple equation relating the three-month T-bill rate to the 
inflation rate (constructed from the Consumer Price Index) is 


i3, = Bo + Byinf, + u, 


(i) Estimate this equation by OLS, omitting the first time period for later comparisons. Report the 
results in the usual form. 

(ii) Some economists feel that the Consumer Price Index mismeasures the true rate of inflation, so that 
the OLS from part (i) suffers from measurement error bias. Reestimate the equation from part (1), 
using inf,_, as an IV for inf, How does the IV estimate of 8, compare with the OLS estimate? 

(iii) Now, first difference the equation: 


Ai3, = Bo + BAinf, + Au,. 


Estimate this by OLS and compare the estimate of 8, with the previous estimates. 
(iv) Can you use Ainf,_, as an IV for Ainf, in the differenced equation in part (iii)? Explain. 
(Hint: Are Ainf, and Ainf,_, sufficiently correlated?) 


Use the data in CARD for this exercise. 

(i) In Table 15.1, the difference between the IV and OLS estimates of the return to education 
is economically important. Obtain the reduced form residuals, >, from the reduced form 
regression educ on nearc4, exper, exper’, black, smsa, south, smsa66, reg662, ..., reg669—see 
Table 15.1. Use these to test whether educ is exogenous; that is, determine if the difference 
between OLS and IV is statistically significant. 

(ii) Estimate the equation by 2SLS, adding nearc2 as an instrument. Does the coefficient on educ 
change much? 

(iii) Test the single overidentifying restriction from part (ii). 


Use the data in MURDER for this exercise. The variable mrdrte is the murder rate, that is, the number 

of murders per 100,000 people. The variable exec is the total number of prisoners executed for the cur- 

rent and prior two years; unem is the state unemployment rate. 

(i) | How many states executed at least one prisoner in 1991, 1992, or 1993? Which state had the 
most executions? 

(ii) Using the two years 1990 and 1993, do a pooled regression of mrdrte on d93, exec, and unem. 
What do you make of the coefficient on exec? 
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C7 


C8 


c9 


(iii) Using the changes from 1990 to 1993 only (for a total of 51 observations), estimate the equation 
Amrdrte = ô) + B,Aexec + B,Aunem + Au 


by OLS and report the results in the usual form. Now, does capital punishment appear to have a 
deterrent effect? 

(iv) The change in executions may be at least partly related to changes in the expected murder rate, 
so that Aexec is correlated with Au in part (iii). It might be reasonable to assume that Aexec_, 
is uncorrelated with Au. (After all, Aexec_,; depends on executions that occurred three or more 
years ago.) Regress Aexec on Aexec_, to see if they are sufficiently correlated; interpret the 
coefficient on Aexec_,. 

(v) Reestimate the equation from part (iii), using Aexec_, as an IV for Aexec. Assume that Aunem 
is exogenous. How do your conclusions change from part (iii)? 


Use the data in PHILLIPS for this exercise. 
(i) In Example 11.5, we estimated an expectations augmented Phillips curve of the form 


Ainf, = Bo + Byunem, + e, 


where Ainf, = inf, — inf,—,. In estimating this equation by OLS, we assumed that the supply shock, 
e„ was uncorrelated with uwnem,. If this is false, what can be said about the OLS estimator of B,? 

(ii) | Suppose that e, is unpredictable given all past information: E(e,inf,_,, unem,_,,...) = 0. 
Explain why this makes unem,_, a good IV candidate for unem,. 

Gii) Regress unem, on unem,_,. Are unem, and unem,_, significantly correlated? 

(iv) Estimate the expectations augmented Phillips curve by IV. Report the results in the usual form 
and compare them with the OLS estimates from Example 11.5. 


Use the data in 401KSUBS for this exercise. The equation of interest is a linear probability model: 
pira = By + Bıp401k + Binec + Bin? + Byage + Bsage* + u. 


The goal is to test whether there is a tradeoff between participating in a 401(k) plan and having 

an individual retirement account (IRA). Therefore, we want to estimate 64. 

(i) Estimate the equation by OLS and discuss the estimated effect of p40/k. 

(ii) For the purposes of estimating the ceteris paribus tradeoff between participation in two different 
types of retirement savings plans, what might be a problem with ordinary least squares? 

(iii) The variable e40/k is a binary variable equal to one if a worker is eligible to participate 
in a 401(k) plan. Explain what is required for e40/k to be a valid IV for p40/k. Do these 
assumptions seem reasonable? 

(iv) Estimate the reduced form for p40/k and verify that e40/k has significant partial correlation 
with p401k. Since the reduced form is also a linear probability model, use a heteroskedasticity- 
robust standard error. 

(v) Now, estimate the structural equation by IV and compare the estimate of 6, with the OLS 
estimate. Again, you should obtain heteroskedasticity-robust standard errors. 

(vi) ‘Test the null hypothesis that p40/k is in fact exogenous, using a heteroskedasticity-robust test. 


The purpose of this exercise is to compare the estimates and standard errors obtained by correctly 
using 2SLS with those obtained using inappropriate procedures. Use the data file WAGE2. 
(i) Use a2SLS routine to estimate the equation 


log(wage) = By + Bieduc + Brexper + B3tenure + Byblack + u, 


where sibs is the IV for educ. Report the results in the usual form. 

(ii) Now, manually carry out 2SLS. That is, first regress educ, on sibs; exper;, tenure; and black; 
and obtain the fitted values, educ,, i = 1,..., n. Then, run the second stage regression log(wage;) 
on educ, exper; tenure; and black; i = 1, . . . , n. Verify that the Ê; are identical to those obtained 
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from part (i), but that the standard errors are somewhat different. The standard errors obtained 
from the second stage regression when manually carrying out 2SLS are generally inappropriate. 
(iii) Now, use the following two-step procedure, which generally yields inconsistent parameter 
estimates of the 6;, and not just inconsistent standard errors. In step one, regress educ; on sibs; 
only and obtain the fitted values, say educ;. (Note that this is an incorrect first stage regression.) 
Then, in the second step, run the regression of log(wage;) on educ,, exper; tenure;, and 
black;, i = 1,...,n. How does the estimate from this incorrect, two-step procedure compare 
with the correct 2SLS estimate of the return to education? 


Use the data in HTV for this exercise. 

G) Runa simple OLS regression of log(wage) on educ. Without controlling for other factors, what 
is the 95% confidence interval for the return to another year of education? 

Gi) The variable ctuit, in thousands of dollars, is the change in college tuition facing students from 
age 17 to age 18. Show that educ and ctuit are essentially uncorrelated. What does this say 
about ctuit as a possible IV for educ in a simple regression analysis? 

(iii) Now, add to the simple regression model in part (i) a quadratic in experience and a full set of 
regional dummy variables for current residence and residence at age 18. Also include the urban 
indicators for current and age 18 residences. What is the estimated return to a year of education? 

(iv) Again using ctuit as a potential IV for educ, estimate the reduced form for educ. [Naturally, the 
reduced form for educ now includes the explanatory variables in part (iii).] Show that ctuit is 
now statistically significant in the reduced form for educ. 

(v) Estimate the model from part (iii) by IV, using ctuit as an IV for educ. How does the confidence 
interval for the return to education compare with the OLS CI from part (iii)? 

(vi) Do you think the IV procedure from part (v) is convincing? 


The data set in VOUCHER, which is a subset of the data used in Rouse (1998), can be used to estimate 

the effect of school choice on academic achievement. Attendance at a choice school was paid for by a 

voucher, which was determined by a lottery among those who applied. The data subset was chosen so 

that any student in the sample has a valid 1994 math test score (the last year available in Rouse’s sample). 

Unfortunately, as pointed out by Rouse, many students have missing test scores, possibly due to attrition 

(that is, leaving the Milwaukee public school district). These data include students who applied to the 

voucher program and were accepted, students who applied and were not accepted, and students who did 

not apply. Therefore, even though the vouchers were chosen by lottery among those who applied, we do 

not necessarily have a random sample from a population where being selected for a voucher has been ran- 
domly determined. (An important consideration is that students who never applied to the program may be 
systematically different from those who did—and in ways that we cannot know based on the data.) 
Rouse (1998) uses panel data methods of the kind we discussed in Chapter 14 to allow student 
fixed effects; she also uses instrumental variables methods. This problem asks you to do a cross- 
sectional analysis which winning the lottery for a voucher acts as an instrumental variable for attending 

a choice school. Actually, because we have multiple years of data on each student, we construct two 

variables. The first, choiceyrs, is the number of years from 1991 to 1994 that a student attended a choice 

school; this variable ranges from zero to four. The variable se/ectyrs indicates the number of years a stu- 
dent was selected for a voucher. If the student applied for the program in 1990 and received a voucher 
then selectyrs = 4; if he or she applied in 1991 and received a voucher then selectyrs = 3; and so on. 

The outcome of interest is mnce, the student’s percentile score on a math test administered in 1994. 

(i) Of the 990 students in the sample, how many were never awarded a voucher? How many had a 
voucher available for four years? How many students actually attended a choice school for four 
years? 

(ii) Runa simple regression of choiceyrs on selectyrs. Are these variables related in the direction 
you expected? How strong is the relationship? Is selectyrs a sensible IV candidate for choiceyrs? 

(iii) Runa simple regression of mnce on choiceyrs. What do you find? Is this what you expected? 
What happens if you add the variables black, hispanic, and female? 
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(iv) Why might choiceyrs be endogenous in an equation such as 
mnce = By + B,choiceyrs + B,black + B;hispanic + B,female + u,? 


(v) Estimate the equation in part (iv) by instrumental variables, using selectyrs as the IV for 
choiceyrs. Does using IV produce a positive effect of attending a choice school? What do you 
make of the coefficients on the other explanatory variables? 

(vi) To control for the possibility that prior achievement affects participating in the lottery (as well 
as predicting attrition), add mnce90—the math score in 1990—to the equation in part (iv). 
Estimate the equation by OLS and IV, and compare the results for B,. For the IV estimate, how 
much is each year in a choice school worth on the math percentile score? Is this a practically 
large effect? 

(vii) Why is the analysis from part (vi) not entirely convincing? [Hint: Compared with part (v), what 
happens to the number of observations, and why?] 

(viii) The variables choiceyrs1, choiceyrs2, and so on are dummy variables indicating the different 
number of years a student could have been in a choice school (from 1991 to 1994). The dummy 
variables selectyrs1, selectyrs2, and so on have a similar definition, but for being selected from 
the lottery. Estimate the equation 


mnce = By + B,choiceyrs1 + B,choiceyrs2 + B3choiceyrs3 + Bychoiceyrs4 
+ Bsblack + Behispanic + B7female + Bgmnce90 + u, 


by IV, using as instruments the four selectyrs dummy variables. (As before, the variables black, 
hispanic, and female act as their own IVs.) Describe your findings. Do they make sense? 


Use the data in CATHOLIC to answer this question. The model of interest is 
math\2 = By + B,cathhs + Bolfaminc + B3motheduc + Bafatheduc + u, 


where cathhs is a binary indicator for whether a student attends a Catholic high school. 

(i) | How many students are in the sample? What percentage of these students attend a Catholic high 
school? 

(ii) Estimate the above equation by OLS. What is the estimate of 64? What is its 95% confidence 
interval? 

(iii) Using parcath as an instrument for cathhs, estimate the reduced form for cathhs. What is the 
t statistic for parcath? Is there evidence of a weak instrument problem? 

(iv) Estimate the above equation by IV, using parcath as an IV for cathhs. How does the estimate 
and 95% CI compare with the OLS quantities? 

(v) Test the null hypothesis that cathhs is exogenous. What is the p-value of the test? 

(vi) Suppose you add the interaction between cathhs + motheduc to the above model. 

Why is it generally endogenous? Why is pareduc + motheduc a good IV candidate for 
cathhs * motheduc? 

(vii) Before you create the interactions in part (vi), first find the sample average of motheduc and 
create cathhs * (motheduc — motheduc) and parcath > (motheduc — motheduc). Add the first 
interaction to the model and use the second as an IV. Of course, cathhs is also instrumented. Is 
the interaction term statistically significant? 

(viii) Compare the coefficient on cathhs in (vii) to that in part (iv). Is including the interaction 
important for estimating the average partial effect? 


Use the data in LABSUP to answer the following questions. These are data on almost 32,000 black 
or Hispanic women. Every woman in the sample is married. It is a subset of the data used in Angrist 
and Evans (1998). Our interest here is in determining how weekly hours worked, hours, changes with 
number of children (kids). All women in the sample have at least two children. The two potential 
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instrumental variables for kids, which is suspected as being endogenous, work to generate exogenous 
variation starting with two children. See the original article for further discussion. 
(i) Estimate the equation 


hours = By + B,kids + B,nonmomi + B,educ + Byage + Bage? + Beblack + Bzhispan + u 


by OLS and obtain the heteroskedasticity-robust standard errors. Interpret the coefficient on 
kids. Discuss its statistical significance. 

Gi) A variable that Angrist and Evans propose as an instrument is samesex, a binary variable equal 
to one if the first two children are the same biological sex. What do you think is the argument 
for why it is a relevant instrument for kids? 

(iii) Run the regression 


. . 2 . 
kids; on samesex; nonmomi;, educ, age; age;, black; hispan; 


and see if the story from part (ii) holds up. In particular, interepret the coefficient on samesex. 
How statistically significant is samesex? 

(iv) Can you think of mechanisms by which samesex is correlated with u in the equation in part (1)? (It is 
fine to assume that biological sex is randomly determined.) [Hint: How might a family’s finances be 
affected based on whether they have two children of the same sex or two children of opposite sex?] 

(v) Isit legitimate to check for exogeneity of samesex by adding it to the regression in part (i) and 
testing its significance? Explain. 

(vi) Using samesex as an IV for kids, obtain the IV estimates of the equation in part (i). How does 
the kids coefficient compare with the OLS estimate? Is the IV estimate precise? 

(vii) Now add multi2nd as an instrument. Obtain the F statistic from the first stage regression and 
determining whether samesex and multi2nd are sufficiently strong. 

(viii) Using samesex and multi2nd both as instruments for kids, how does the 2SLS estimate compare 
with the OLS and IV estimates from the previous parts? 

(ix) Using the estimation from part (viii), is there strong evidence that kids is endogenous in the 
hours equation? 

(x) In part (viii), how many overidentification restrictions are there? Does the overidentification 
test pass? 


APPENDIX 15A 


15A.1 Assumptions for Two Stage Least Squares 


This appendix covers the assumptions under which 2SLS has desirable large sample properties. We 
first state the assumptions for cross-sectional applications under random sampling. Then, we discuss 
what needs to be added for them to apply to time series and panel data. 


15A.2 Assumption 2SLS.1 (Linear in Parameters) 
The model in the population can be written as 


y = Po + Bixi + Boxy + + Bex, +u, 


where Bo, B1, ---, B are the unknown parameters (constants) of interest and u is an unobserved ran- 
dom error or random disturbance term. The instrumental variables are denoted as z;. 
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It is worth emphasizing that Assumption 2SLS.1 is virtually identical to MLR.1 (with the minor 
exception that 2SLS.1 mentions the notation for the instrumental variables, z;). In other words, the 
model we are interested in is the same as that for OLS estimation of the 6;. Sometimes it is easy to 
lose sight of the fact that we can apply different estimation methods to the same model. Unfortu- 
nately, it is not uncommon to hear researchers say “I estimated an OLS model” or “I used a 2SLS 
model.” Such statements are meaningless. OLS and 2SLS are different estimation methods that are 
applied to the same model. It is true that they have desirable statistical properties under different 
sets of assumptions on the model, but the relationship they are estimating is given by the equation in 
2SLS.1 (or MLR.1). The point is similar to that made for the unobserved effects panel data model 
covered in Chapters 13 and 14: pooled OLS, first differencing, fixed effects, and random effects are 
different estimation methods for the same model. 


15A.3 Assumption 2SLS.2 (Random Sampling) 


We have a random sample on y, the x;, and the z;. 


15A.4 Assumption 2SLS.3 (Rank Condition) 


(i) There are no perfect linear relationships among the instrumental variables. (ii) The rank condition 
for identification holds. 

With a single endogenous explanatory variable, as in equation (15.42), the rank condition is eas- 
ily described. Let z4, . . . , z,, denote the exogenous variables, where z,,..., Zm do not appear in the 
structural model (15.42). The reduced form of y, is 


Y2 = Wy E My H Mao H Wy pZpay E We FF Ngm t Vo. 


Then, we need at least one of Ty ..., Tm to be nonzero. This requires at least one exogenous 
variable that does not appear in (15.42) (the order condition). Stating the rank condition with two 
or more endogenous explanatory variables requires matrix algebra. [See Wooldridge (2010, 


Chapter 5).] 


15A.5 Assumption 2SLS.4 (Exogenous Instrumental Variables) 


The error term u has zero mean, and each IV is uncorrelated with u. 
Remember that any x; that is uncorrelated with u also acts as an IV. 


15A.6 Theorem 15A.1 


Under Assumptions 2SLS.1 through 2SLS.4, the 2SLS estimator is consistent. 


15A.7 Assumption 2SLS.5 (Homoskedasticity) 


Let z denote the collection of all instrumental variables. Then, E(u’|z) = o°. 


15A.8 Theorem 15A.2 


Under Assumptions 2SLS.1 through 2SLS.5, the 2SLS estimators are asymptotically normally dis- 
tributed. Consistent estimators of the asymptotic variance are given as in equation (15.43), where o? 
is replaced with 6’ = (n — k — 1) 'X;—1 a, and the ĝ; are the 2SLS residuals. 

The 2SLS estimator is also the best IV estimator under the five assumptions given. We state the 
result here. A proof can be found in Wooldridge (2010, Chapter 5). 
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15A.9 Theorem 15A.3 


Under Assumptions 2SLS.1 through 2SLS.5, the 2SLS estimator is asymptotically efficient in the 
class of IV estimators that uses linear combinations of the exogenous variables as instruments. 

If the homoskedasticity assumption does not hold, the 2SLS estimators are still asymptotically 
normal, but the standard errors (and ż and F statistics) need to be adjusted; many econometrics pack- 
ages do this routinely. Moreover, the 2SLS estimator is no longer the asymptotically efficient IV esti- 
mator, in general. We will not study more efficient estimators here [see Wooldridge (2010, Chapter 8)]. 

For time series applications, we must add some assumptions. First, as with OLS, we must 
assume that all series (including the IVs) are weakly dependent: this ensures that the law of large 
numbers and the central limit theorem hold. For the usual standard errors and test statistics to be 
valid, as well as for asymptotic efficiency, we must add a no serial correlation assumption. 


15A.10 Assumption 2SLS.6 (No Serial Correlation) 


Equation (15.54) holds. 
A similar no serial correlation assumption is needed in panel data applications. Tests and correc- 
tions for serial correlation were discussed in Section 15-7. 
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n the previous chapter, we showed how the method of instrumental variables can solve two kinds 

of endogeneity problems: omitted variables and measurement error. Conceptually, these problems 

are straightforward. In the omitted variables case, there is a variable (or more than one) that we 
would like to hold fixed when estimating the ceteris paribus effect of one or more of the observed 
explanatory variables. In the measurement error case, we would like to estimate the effect of certain 
explanatory variables on y, but we have mismeasured one or more variables. In both cases, we could 
estimate the parameters of interest by OLS if we could collect better data. 

Another important form of endogeneity of explanatory variables is simultaneity. This arises 
when one or more of the explanatory variables is jointly determined with the dependent variable, 
typically through an equilibrium mechanism (as we will see later). In this chapter, we study methods 
for estimating simple simultaneous equations models (SEMs). Although a complete treatment of 
SEMs is beyond the scope of this text, we are able to cover models that are widely used. 

The leading method for estimating simultaneous equations models is the method of instrumental 
variables. Therefore, the solution to the simultaneity problem is essentially the same as the IV 
solutions to the omitted variables and measurement error problems. However, crafting and interpreting 
SEMs is challenging. Therefore, we begin by discussing the nature and scope of simultaneous equa- 
tions models in Section 16-1. In Section 16-2, we confirm that OLS applied to an equation in a 
simultaneous system is generally biased and inconsistent. 

Section 16-3 provides a general description of identification and estimation in a two-equation 


system, while Section 16-4 briefly covers models with more than two equations. Simultaneous 
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equations models are used to model aggregate time series, and in Section 16-5 we include a discus- 
sion of some special issues that arise in such models. Section 16-6 touches on simultaneous equations 


models with panel data. 


16-1 The Nature of Simultaneous Equations Models 


The most important point to remember in using simultaneous equations models is that each equation 
in the system should have a ceteris paribus, causal interpretation. Because we only observe the out- 
comes in equilibrium, we are required to use counterfactual reasoning in constructing the equations 
of a simultaneous equations model. We must think in terms of potential as well as actual outcomes. 

When there are only two states of the world—a worker does or does not participate in a job train- 
ing program, say—we formally described the potential outcomes setting in Sections 2-7, 3-7, 7-6, and 
elsewhere. The framework for simultaneous equations models is more complicated because we must 
represent a continuum of alternative realities. For example, the demand for a product, say, milk, is a 
function of the price of milk (and other variables). A demand function for milk determines how much 
milk someone would purchase at each possible price. Rather than formally introduce a notation for a 
continuum of potential outcomes, for our purposes it suffices to be less formal and to illustrate coun- 
terfactual thinking through examples. 

The classic example of an SEM is a supply and demand equation for some commodity or input to 
production (such as labor). For concreteness, let A, denote the annual labor hours supplied by workers 
in agriculture, measured at the county level, and let w denote the average hourly wage offered to such 
workers. A simple labor supply function is 


h, = aw + Biz + uy, [16.1] 


where z, is some observed variable affecting labor supply—say, the average manufacturing wage in 
the county. The error term, u, contains other factors that affect labor supply. [Many of these factors 
are observed and could be included in equation (16.1); to illustrate the basic concepts, we include 
only one such factor, z,.] Equation (16.1) is an example of a structural equation. This name comes 
from the fact that the labor supply function is derivable from economic theory and has a causal 
interpretation. The coefficient a, measures how labor supply changes when the wage changes; if h, 
and w are in logarithmic form, a, is the labor supply elasticity. Typically, we expect a, to be posi- 
tive (although economic theory does not rule out a, = 0). Labor supply elasticities are important 
for determining how workers will change the number of hours they desire to work when tax rates on 
wage income change. If z; is the manufacturing wage, we expect 8, = 0: other factors equal, if the 
manufacturing wage increases, more workers will go into manufacturing than into agriculture. 

When we graph labor supply, we sketch hours as a function of wage, with z; and u, held fixed. 
A change in z, shifts the labor supply function, as does a change in u,. The difference is that z; is 
observed while u, is not. Sometimes, z; is called an observed supply shifter, and u is called an unob- 
served supply shifter. 

How does equation (16.1) differ from those we have studied previously? The difference is subtle. 
Although equation (16.1) is supposed to hold for all possible values of wage, we cannot generally 
view wage as varying exogenously for a cross section of counties. If we could run an experiment 
where we vary the level of agricultural and manufacturing wages across a sample of counties and 
survey workers to obtain the labor supply h, for each county, then we could estimate (16.1) by OLS. 
Unfortunately, this is not a manageable experiment. Instead, we must collect data on average wages in 
these two sectors along with how many person hours were spent in agricultural production. In decid- 
ing how to analyze these data, we must understand that they are best described by the interaction of 
labor supply and demand. Under the assumption that labor markets clear, we actually observe equilib- 
rium values of wages and hours worked. 
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To describe how equilibrium wages and hours are determined, we need to bring in the demand 
for labor, which we suppose is given by 


hy = aw + Bzz + Ud, [16.2] 


where h, is hours demanded. As with the supply function, we graph hours demanded as a function 
of wage, w, keeping z, and u, fixed. The variable z,—say, agricultural land area—is an observable 
demand shifter, while u, is an unobservable demand shifter. 

Just as with the labor supply equation, the labor demand equation is a structural equation: it can 
be obtained from the profit maximization considerations of farmers. If h; and w are in logarithmic 
form, œ, is the labor demand elasticity. Economic theory tells us that a, < 0. Because labor and land 
are complements in production, we expect B, > 0. 

Notice how equations (16.1) and (16.2) describe entirely different relationships. Labor supply 
is a behavioral equation for workers, and labor demand is a behavioral relationship for farmers. 
Each equation has a ceteris paribus interpretation and stands on its own. They become linked in an 
econometric analysis only because observed wage and hours are determined by the intersection of 
supply and demand. In other words, for each county i, observed hours h; and observed wage w; are 
determined by the equilibrium condition 


hi = hig [16.3] 


Because we observe only equilibrium hours for each county i, we denote observed hours by h;. 
When we combine the equilibrium condition in (16.3) with the labor supply and demand 
equations, we get 


h= 


L 


QW; + Bza + Uii [16.4] 
and 


h; = AW; + Bza + Un, [16.5] 


where we explicitly include the 7 subscript to emphasize that h; and w; are the equilibrium observed 
values for county i. These two equations constitute a simultaneous equations model (SEM), which 
has several important features. First, given Zj, Z;2, uj, and up, these two equations determine h; and 
w;. (Actually, we must assume that a, # œ, which means that the slopes of the supply and demand 
functions differ; see Problem 1.) For this reason, h; and w; are the endogenous variables in this 
SEM. What about z; and zp? Because they are determined outside of the model, we view them as 
exogenous variables. From a statistical standpoint, the key assumption concerning z; and zp is that 
they are both uncorrelated with the supply and demand errors, u; and uj, respectively. These are 
examples of structural errors because they appear in the structural equations. 

A second important point is that, without including z, and z, in the model, there is no way to tell 
which equation is the supply function and which is the demand function. When z, represents manu- 
facturing wage, economic reasoning tells us that it is a factor in agricultural labor supply because it is 
a measure of the opportunity cost of working in agriculture; when z, stands for agricultural land area, 
production theory implies that it appears in the labor demand function. Therefore, we know that (16.4) 
represents labor supply and (16.5) represents labor demand. If z; and z, are the same—for example, 
average education level of adults in the county, which can affect both supply and demand—then the 
equations look identical, and there is no hope of estimating either one. In a nutshell, this illustrates 
the identification problem in simultaneous equations models, which we will discuss more generally in 
Section 16-3. 

The most convincing examples of SEMs have the same flavor as supply and demand examples. 
Each equation should have a behavioral, ceteris paribus interpretation on its own. Because we only 
observe equilibrium outcomes, specifying an SEM requires us to ask such counterfactual questions 
as: How much labor would workers provide if the wage were different from its equilibrium value? 
Example 16.1 provides another illustration of an SEM in which each equation has a ceteris paribus 
interpretation. 
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Murder Rates and Size of the Police Force 


Cities often want to determine how much additional law enforcement will decrease their murder rates. 
A simple cross-sectional model to address this question is 


murdpc = a,polpc + Bio + Byyincpc + u, [16.6] 


where murdpc is murders per capita, polpc is number of police officers per capita, and incpc is income 
per capita. (Henceforth, we do not include an i subscript.) We take income per capita as exogenous in 
this equation. In practice, we would include other factors, such as age and gender distributions, educa- 
tion levels, perhaps geographic variables, and variables that measure severity of punishment. To fix 
ideas, we consider equation (16.6). 

The question we hope to answer is: If a city exogenously increases its police force, will that 
increase, on average, lower the murder rate? If we could exogenously choose police force sizes for a 
random sample of cities, we could estimate (16.6) by OLS. Certainly, we cannot run such an experi- 
ment. But can we think of police force size as being exogenously determined, anyway? Probably not. 
A city’s spending on law enforcement is at least partly determined by its expected murder rate. To 
reflect this, we postulate a second relationship: 


polpc = aymurdpc + Bz + other factors. [16.7] 


We expect that a, > 0: other factors being equal, cities with higher (expected) murder rates will have 
more police officers per capita. Once we specify the other factors in (16.7), we have a two-equation 
simultaneous equations model. We are really only interested in equation (16.6), but, as we will see in 
Section 16-3, we need to know precisely how the second equation is specified in order to estimate the first. 

An important point is that (16.7) describes behavior by city officials, while (16.6) describes the 
actions of potential murderers. This gives each equation a clear ceteris paribus interpretation, which 
makes equations (16.6) and (16.7) an appropriate simultaneous equations model. 


We next give an example of an inappropriate use of SEMs. 


Housing Expenditures and Saving 


Suppose that, for a random household in the population, we assume that annual housing expenditures 
and saving are jointly determined by 


housing = a,saving + Bio + Bainc + Byeduc + By,age + u [16.8] 
and 
saving = a,housing + By + Bainc + Byeduc + Baage + u, [16.9] 


where inc is annual income and educ and age are measured in years. Initially, it may seem that these 
equations are a sensible way to view how housing and saving expenditures are determined. But we have 
to ask: What value would one of these equations be without the other? Neither has a ceteris paribus inter- 
pretation because housing and saving are chosen by the same household. For example, it makes no sense 
to ask this question: If annual income increases by $10,000, how would housing expenditures change, 
holding saving fixed? If family income increases, a household will generally change the optimal mix of 
housing expenditures and saving. But equation (16.8) makes it seem as if we want to know the effect of 
changing inc, educ, or age while keeping saving fixed. Such a thought experiment is not interesting. Any 
model based on economic principles, particularly utility maximization, would have households opti- 
mally choosing housing and saving as functions of inc and the relative prices of housing and saving. The 
variables educ and age would affect preferences for consumption, saving, and risk. Therefore, housing 
and saving would each be functions of income, education, age, and other variables that affect the utility 
maximization problem (such as different rates of return on housing and other saving). 
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Even if we decided that the SEM in (16.8) and (16.9) made sense, there is no way to estimate the 
parameters. (We discuss this problem more generally in Section 16-3.) The two equations are indistin- 
guishable, unless we assume that income, education, or age appears in one equation but not the other, 
which would make no sense. 

Though this makes a poor SEM example, we might be interested in testing whether, other factors 
being fixed, there is a tradeoff between housing expenditures and saving. But then we would just esti- 
mate, say, (16.8) by OLS, unless there is an omitted variable or measurement error problem. 


Example 16.2 has the characteristics of all too many SEM applications. The problem is that the 
two endogenous variables are chosen by the same economic agent. Therefore, neither equation can 
stand on its own. Another example of an inappropriate use of an SEM would be to model weekly hours 
spent studying and weekly hours working. Each student will choose these variables simultaneously— 
presumably as a function of the wage that can be earned working, ability as a student, enthusiasm for 
college, and so on. Just as in Example 16.2, it makes no sense to specify two equations where each is 
a function of the other. The important lesson is this: just because two variables are determined simul- 

taneously does not mean that a simultaneous equa- 
tions model is suitable. For an SEM to make sense, 
T each equation in the SEM should have a ceteris pari- 
G S tencere mocil ai eovenieng ior VODEB] bus interpretation in isolation from the other equation. 
olistic firms has firms choosing profit : 
maximizing levels of price and advertising As we discussed earlier, supply and demand examples, 
expenditures. Does this mean we should and Example 16.1, have this feature. Usually, basic 
use an SEM to model these variables at the | economic reasoning, supported in some cases by sim- 
firm level? ple economic models, can help us use SEMs intelli- 
gently (including knowing when not to use an SEM). 


16-2 Simultaneity Bias in OLS 


It is useful to see, in a simple model, that an explanatory variable that is determined simultaneously 
with the dependent variable is generally correlated with the error term, which leads to bias and incon- 
sistency in OLS. We consider the two-equation structural model 


Yı 5&2 + zı + u [16.10] 


Yo = Any, + Boz + Uy [16.11] 


and focus on estimating the first equation. The variables z, and z, are exogenous, so that each is 
uncorrelated with u; and vu. For simplicity, we suppress the intercept in each equation. 

To show that y, is generally correlated with u,, we solve the two equations for y, in terms of the 
exogenous variables and the error term. If we plug the right-hand side of (16.10) in for y; in (16.11), 
we get 


Y2 = (ay, + Biz + uy) + Boz + Uy 
or 
(1 — œa )y2 = &ßizı + Boz, + au, + u. [16.12] 
Now, we must make an assumption about the parameters in order to solve for y,: 
ana, # 1. [16.13] 


Whether this assumption is restrictive depends on the application. In Example 16.1, we think that 
a, = 0 and a, = 0, which implies œjœ, = 0; therefore, (16.13) is very reasonable for Example 16.1. 
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Provided condition (16.13) holds, we can divide (16.12) by (1 — a,a,) and write y, as 
Y2 = TaZ + WZ + Vo, [16.14] 


where 77, = a>8,/(1 — œa), Ta = Bo/(1 — œa), and v, = (azu, + u,)/(1 — œa). Equation 
(16.14), which expresses y, in terms of the exogenous variables and the error terms, is the reduced form 
equation for y,, a concept we introduced in Chapter 15 in the context of instrumental variables estima- 
tion. The parameters mT; and 7», are called reduced form parameters; notice how they are nonlinear 
functions of the structural parameters, which appear in the structural equations, (16.10) and (16.11). 

The reduced form error, v, is a linear function of the structural error terms, u, and uj. Because 
u, and u, are each uncorrelated with z; and z», v, is also uncorrelated with z; and z». Therefore, we can 
consistently estimate m; and 773, by OLS, something that is used for two stage least squares estima- 
tion (which we return to in the next section). In addition, the reduced form parameters are sometimes 
of direct interest, although we are focusing here on estimating equation (16.10). 

A reduced form also exists for y, under assumption (16.13); the algebra is similar to that used to 
obtain (16.14). It has the same properties as the reduced form equation for yz. 

We can use equation (16.14) to show that, except under special assumptions, OLS estimation of 
equation (16.10) will produce biased and inconsistent estimators of a, and 6; in equation (16.10). 
Because z, and uw, are uncorrelated by assumption, the issue is whether y, and uw, are uncorrelated. 
From the reduced form in (16.14), we see that y, and u, are correlated if and only if v, and u, are 
correlated (because z, and z, are assumed exogenous). But v, is a linear function of u, and u, so it is 
generally correlated with u. In fact, if we assume that u, and u, are uncorrelated, then v, and u, must 
be correlated whenever a, # 0. Even if œ, equals zero—which means that y, does not appear in equa- 
tion (16.11)— v, and u, will be correlated if u; and u, are correlated. 

When a, = 0 and uw, and u, are uncorrelated, y, and u, are also uncorrelated. These are fairly 
strong requirements: if a, = 0, y, is not simultaneously determined with y,. If we add zero correla- 
tion between u, and u, this rules out omitted variables or measurement errors in u that are correlated 
with y,. We should not be surprised that OLS estimation of equation (16.10) works in this case. 

When y, is correlated with u, because of simultaneity, we say that OLS suffers from simultaneity 
bias. Obtaining the direction of the bias in the coefficients is generally complicated, as we saw with 
omitted variables bias in Chapters 3 and 5. But in simple models, we can determine the direction of 
the bias. For example, suppose that we simplify equation (16.10) by dropping z, from the equation, 
and we assume that u, and u, are uncorrelated. Then, the covariance between y, and uw, is 


Cov(y,tt;) = Cov(v,,u;) = [a,/(1 = aa) JE(u7) 


= [a,/(1 = ona) loi, 


where oj = Var(u,) > 0. Therefore, the asymptotic bias (or inconsistency) in the OLS estimator 
of a, has the same sign as a,/(1 — aa). If ay > 0 and œa; < 1, the asymptotic bias is positive. 
(Unfortunately, just as in our calculation of omitted variables bias from Section 3-3, the conclusions 
do not carry over to more general models. But they do serve as a useful guide.) For example, in 
Example 16.1, we think a, > 0 and a,a, = 0, which means that the OLS estimator of a; would have 
a positive bias. If a, = 0, OLS would, on average, estimate a positive impact of more police on the 
murder rate; generally, the estimator of a, is biased upward. Because we expect an increase in the 
size of the police force to reduce murder rates (ceteris paribus), the upward bias means that OLS will 
underestimate the effectiveness of a larger police force. 


16-3 Identifying and Estimating a Structural Equation 


AS we Saw in the previous section, OLS is biased and inconsistent when applied to a structural equa- 
tion in a simultaneous equations system. In Chapter 15, we learned that the method of two stage least 
squares can be used to solve the problem of endogenous explanatory variables. We now show how 
2SLS can be applied to SEMs. 
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The mechanics of 2SLS are similar to those in Chapter 15. The difference is that, because we 
specify a structural equation for each endogenous variable, we can immediately see whether sufficient 
IVs are available to estimate either equation. We begin by discussing the identification problem. 


16-3a Identification in a Two-Equation System 


We mentioned the notion of identification in Chapter 15. When we estimate a model by OLS, the 
key identification condition is that each explanatory variable is uncorrelated with the error term. As 
we demonstrated in Section 16-2, this fundamental condition no longer holds, in general, for SEMs. 
However, if we have some instrumental variables, we can still identify (or consistently estimate) the 
parameters in an SEM equation, just as with omitted variables or measurement error. 

Before we consider a general two-equation SEM, it is useful to gain intuition by considering a 
simple supply and demand example. Write the system in equilibrium form (that is, with g, = qa = q 
imposed) as 


q = ap + Biz + u [16.15] 
and 
q = œp + Uy. [16.16] 


For concreteness, let q be per capita milk consumption at the county level, let p be the average price 
per gallon of milk in the county, and let z; be the price of cattle feed, which we assume is exogenous 
to the supply and demand equations for milk. This means that (16.15) must be the supply function, as 
the price of cattle feed would shift supply (B, < 0) but not demand. The demand function contains 
no observed demand shifters. 

Given a random sample on (q, p, z1), which of these equations can be estimated? That is, which 
is an identified equation? It turns out that the demand equation, (16.16), is identified, but the supply 
equation is not. This is easy to see by using our rules for IV estimation from Chapter 15: we can use 
zı as an IV for price in equation (16.16). However, because z; appears in equation (16.15), we have no 
IV for price in the supply equation. 


FIGURE 16.1 Shifting supply equations trace out the demand equation. Each supply 


equation is drawn for a different value of the exogenous variable, z,. 
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Intuitively, the fact that the demand equation is identified follows because we have an observed 
variable, z,, that shifts the supply equation while not affecting the demand equation. Given variation 
in z; and no errors, we can trace out the demand curve, as shown in Figure 16.1. The presence of the 
unobserved demand shifter u, causes us to estimate the demand equation with error, but the estimators 
will be consistent, provided z; is uncorrelated with uw). 

The supply equation cannot be traced out because there are no exogenous observed factors shift- 
ing the demand curve. It does not help that there are unobserved factors shifting the demand function; 
we need something observed. If, as in the labor demand function (16.2), we have an observed exog- 
enous demand shifter—such as income in the milk demand function—then the supply function would 
also be identified. 

To summarize: In the system of (16.15) and (16.16), it is the presence of an exogenous variable in 
the supply equation that allows us to estimate the demand equation. 

Extending the identification discussion to a general two-equation model is not difficult. Write the 
two equations as 


Vi = Bio + ay. + ZB; + u [16.17] 
and 


Y2 = Boo + ayı + ZP + Uy, [16.18] 


where y, and y, are the endogenous variables and u, and u, are the structural error terms. The intercept 
in the first equation is 6,9, and the intercept in the second equation is 89. The variable z} denotes a set 
of k, exogenous variables appearing in the first equation: z; = (Zip Zj,- ++; Zik): Similarly, z, is the 
set of k, exogenous variables in the second equation: 2) = (Z214, Z223 - - -> Z2% )- In many cases, z, and z, 
will overlap. As a shorthand form, we use the notation 


“By = Bizi + Bizin +e + Pur Zit, 


and 


ZB = Bo Zo + Book. Fo + Bop,Z2K,3 


that is, z, 8, stands for all exogenous variables in the first equation, with each multiplied by a coef- 
ficient, and similarly for z,8,. (Some authors use the notation zB, and z5B, instead. If you have an 
interest in the matrix algebra approach to econometrics, see Advanced Treatment E.) 

The fact that z; and z, generally contain different exogenous variables means that we have imposed 
exclusion restrictions on the model. In other words, we assume that certain exogenous variables do 
not appear in the first equation and others are absent from the second equation. As we saw with the pre- 
vious supply and demand examples, this allows us to distinguish between the two structural equations. 

When can we solve equations (16.17) and (16.18) for y; and y, (as linear functions of all exog- 
enous variables and the structural errors, u; and u)? The condition is the same as that in (16.13), 
namely, œœ; # 1. The proof is virtually identical to the simple model in Section 16-2. Under this 
assumption, reduced forms exist for y, and yp. 

The key question is: Under what assumptions can we estimate the parameters in, say, (16.17)? This 
is the identification issue. The rank condition for identification of equation (16.17) is easy to state. 


Rank Condition for Identification of a Structural Equation. The first equation 
in a two-equation simultaneous equations model is identified if, and only if, the second equation contains 
at least one exogenous variable (with a nonzero coefficient) that is excluded from the first equation. 

This is the necessary and sufficient condition for equation (16.17) to be identified. The order 
condition, which we discussed in Chapter 15, is necessary for the rank condition. The order condi- 
tion for identifying the first equation states that at least one exogenous variable is excluded from this 
equation. The order condition is trivial to check once both equations have been specified. The rank 
condition requires more: at least one of the exogenous variables excluded from the first equation 
must have a nonzero population coefficient in the second equation. This ensures that at least one of 
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the exogenous variables omitted from the first equation actually appears in the reduced form of y,, so 
that we can use these variables as instruments for y,. We can test this using a ¢ or an F test, as in 
Chapter 15; some examples follow. 

Identification of the second equation is, naturally, just the mirror image of the statement for 
the first equation. Also, if we write the equations as in the labor supply and demand example in 
Section 16-1—-so that y, appears on the left-hand side in both equations, with y, on the right-hand 
side—the identification condition is identical. 


Labor Supply of Married, Working Women 


To illustrate the identification issue, consider labor supply for married women already in the work- 
force. In place of the demand function, we write the wage offer as a function of hours and the usual 
productivity variables. With the equilibrium condition imposed, the two structural equations are 


hours = a,log(wage) + Bio + Biyeduc + Byage + B,3kidslt6 


+ B,nwifeinc + u 


[16.19] 


and 


log(wage) = ashours + By + B.,educ + B,,.exper 


16.2 
+ Bo3exper” + Uy. Nee) 


The variable age is the woman’s age, in years, kids/t6 is the number of children less than six years old, 
nwifeinc is the woman’s nonwage income (which includes husband’s earnings), and educ and exper 
are years of education and prior experience, respectively. All variables except hours and log(wage) 
are assumed to be exogenous. (This is a tenuous assumption, as educ might be correlated with omit- 
ted ability in either equation. But for illustration purposes, we ignore the omitted ability problem.) 
The functional form in this system—where hours appears in level form but wage is in logarithmic 
form—is popular in labor economics. We can write this system as in equations (16.17) and (16.18) by 
defining y, = hours and y, = log(wage). 

The first equation is the supply function. It satisfies the order condition because two exogenous 
variables, exper and exper’, are omitted from the labor supply equation. These exclusion restrictions 
are crucial assumptions: we are assuming that, once wage, education, age, number of small children, 
and other income are controlled for, past experience has no effect on current labor supply. One could 
certainly question this assumption, but we use it for illustration. 

Given equations (16.19) and (16.20), the rank condition for identifying the first equation is that 
at least one of exper and exper’ has a nonzero coefficient in equation (16.20). If B») = 0 and B,, = 0, 
there are no exogenous variables appearing in the second equation that do not also appear in the first 
(educ appears in both). We can state the rank condition for identification of (16.19) equivalently in 
terms of the reduced form for log(wage), which is 

log(wage) = T% + meduc + Tage + 1,kidslt6 16.21] 


+ a 4nwifeinc + msexper + mexper + vz. 


For identification, we need 77,, # 0 or Tas # 0, something we can test using a standard F statistic, as 
we discussed in Chapter 15. 

The wage offer equation, (16.20), is identified if at least one of age, kidslt6, or nwifeinc has a non- 
zero coefficient in (16.19). This is identical to assuming that the reduced form for hours—which has 
the same form as the right-hand side of (16.21)—depends on at least one of age, kidslt6, or nwifeinc. 
In specifying the wage offer equation, we are assuming that age, kidslt6, and nwifeinc have no effect 
on the offered wage, once hours, education, and experience are accounted for. These would be poor 
assumptions if these variables somehow have direct effects on productivity, or if women are discrimi- 
nated against based on their age or number of small children. 
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In Example 16.3, we take the population of interest to be married women who are in the workforce 
(so that equilibrium hours are positive). This excludes the group of married women who choose not to 
work outside the home. Including such women in the model raises some difficult problems. For instance, 
if a woman does not work, we cannot observe her wage offer. We touch on these issues in Chapter 17; but 
for now, we must think of equations (16.19) and (16.20) as holding only for women who have hours > 0. 


EXAMPLE 16.4 Inflation and Openness 


Romer (1993) proposes theoretical models of inflation that imply that more “open” countries should 
have lower inflation rates. His empirical analysis explains average annual inflation rates (since 1973) 
in terms of the average share of imports in gross domestic (or national) product since 1973—which 
is his measure of openness. In addition to estimating the key equation by OLS, he uses instrumental 
variables. While Romer does not specify both equations in a simultaneous system, he has in mind a 
two-equation system: 


inf = Bio + ayopen + B,,log(pcinc) + u [16.22] 


open = Boy + ayinf + Blog(pcinc) + Balog(land) + u, [16.23] 


where pcinc is 1980 per capita income, in U.S. dollars (assumed to be exogenous), and land is the 
land area of the country, in square miles (also assumed to be exogenous). Equation (16.22) is the 
one of interest, with the hypothesis that a, < 0. (More open economies have lower inflation rates.) 
The second equation reflects the fact that the degree of openness might depend on the average infla- 
tion rate, as well as other factors. The variable log(pcinc) appears in both equations, but log(/and) is 
assumed to appear only in the second equation. The 
idea is that, ceteris paribus, a smaller country is likely 
GOING FURTHER 16.2 to be more open (so B» < 0). 

Using the identification rule that was stated ear- 
1973 for each country, which we assume is lier, equation (16.22) is identified, provided B,, # 0. 


exogenous, does this help identify equation Equation (16.23) is not identified because it con- 
(16.23)? tains both exogenous variables. But we are interested 


in (16.22). 


If we have money supply growth since 


16-3b Estimation by 2SLS 


Once we have determined that an equation is identified, we can estimate it by two stage least squares. 
The instrumental variables consist of the exogenous variables appearing in either equation. 


Labor Supply of Married, Working Women 


We use the data on working, married women in MROZ to estimate the labor supply equation (16.19) 
by 2SLS. The full set of instruments includes educ, age, kidslt6, nwifeinc, exper, and exper”. The 
estimated labor supply curve is 


hours = 2,225.66 + 1,639.56 log(wage) — 183.75 educ 


(574.56) (470.58) (59.10) 
— 7.81 age — 198.15 kidslt6 — 10.17 nwifeinc [16.24] 
(9.38) (182.93) (6.61) 


n = 428, 
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where the reported standard errors are computed using a degrees-of-freedom adjustment. This equa- 
tion shows that the labor supply curve slopes upward. The estimated coefficient on log(wage) has 
the following interpretation: holding other factors fixed, Ahours ~ 16.4(%Awage). We can calculate 
labor supply elasticities by multiplying both sides of this last equation by 100/hours: 


100-(Ahours/hours) = (1,640/hours) (%Awage) 
or 
%Ahours ~ (1,640/hours) (%Awage), 


which implies that the labor supply elasticity (with respect to wage) is simply 1,640/hours. [The 
elasticity is not constant in this model because hours, not log(hours), is the dependent variable in 
(16.24).] At the average hours worked, 1,303, the estimated elasticity is 1,640/1,303 ~ 1.26, which 
implies a greater than 1% increase in hours worked given a 1% increase in wage. This is a large esti- 
mated elasticity. At higher hours, the elasticity will be smaller; at lower hours, such as hours = 800, 
the elasticity is over two. 

For comparison, when (16.19) is estimated by OLS, the coefficient on log(wage) is —2.05 
(se = 54.88), which implies no wage effect on hours worked. To confirm that log(wage) is in fact 
endogenous in (16.19), we can carry out the test from Section 15-5. When we add the reduced 
form residuals Ŷ, to the equation and estimate by OLS, the ¢ statistic on ĵ, is —6.61, which is very 
significant, and so log(wage) appears to be endogenous. 

The wage offer equation (16.20) can also be estimated by 2SLS. The result is 


log(wage) = —.656 + .00013 hours + .110 educ 
(.338) (.00025) (.016) 
+ .035 exper — .00071 exper [16.25] 
(.019) (.00045) 
n = 428. 


This differs from previous wage equations in that hours is included as an explanatory variable and 
2SLS is used to account for endogeneity of hours (and we assume that educ and exper are exog- 
enous). The coefficient on hours is statistically insignificant, which means that there is no evidence 
that the wage offer increases with hours worked. The other coefficients are similar to what we get by 
dropping hours and estimating the equation by OLS. 


Estimating the effect of openness on inflation by instrumental variables is also straightforward. 


EXAMPLE 16.6 Inflation and Openness 
Before we estimate (16.22) using the data in OPENNESS, we check to see whether open has suffi- 


cient partial correlation with the proposed IV, log(/and). The reduced form regression is 
open = 117.08 + .546 log(pcinc) — 7.57 log(land) 
(15.85) (1.493) (.81) 
n = 114, R? = 449. 


The ¢ statistic on log(/and) is over nine in absolute value, which verifies Romer’s assertion that 
smaller countries are more open. The fact that log(pcinc) is so insignificant in this regression is 
irrelevant. 
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Estimating (16.22) using log(/and) as an IV for open gives 
inf = 26.90 — .337 open + .376 log(pcinc) 


(15.40) (.144) (2.015) [16.26] 
n= 114. 


The coefficient on open is statistically significant at 

GOING FURTHER 16.3 about the 1% level against a one-sided alternative 
How would you test whether the difference | (a, < 0). The effect is economically important as 
between the OLS and IV estimates on open | well: for every percentage point increase in the import 
are statistically different? share of GDP, annual inflation is about one-third of 
a percentage point lower. For comparison, the OLS 
estimate is —.215 (se = .095). 


16-4 Systems with More Than Two Equations 


Simultaneous equations models can consist of more than two equations. Studying general identifica- 
tion of these models is difficult and requires matrix algebra. Once an equation in a general system has 
been shown to be identified, it can be estimated by 2SLS. 


16-4a Identification in Systems with Three or More Equations 


We will use a three-equation system to illustrate the issues that arise in the identification of compli- 
cated SEMs. With intercepts suppressed, write the model as 


Yı = Q2 F Ay3¥3 + Biz + u [16.27] 
Y2 = Qayı + Bazi + By% + Bazs + Uy [16.28] 
Y3 = Q322 + B32 + B32% + B33%3 + B34Z4 + Us, [16.29] 


where the y, are the endogenous variables and the z; are exogenous. The first subscript on the parame- 
ters indicates the equation number, and the second indicates the variable number; we use @ for param- 
eters on endogenous variables and 6 for parameters on exogenous variables. 

Which of these equations can be estimated? It is generally difficult to show that an equation in 
an SEM with more than two equations is identified, but it is easy to see when certain equations are 
not identified. In system (16.27) through (16.29), we can easily see that (16.29) falls into this cat- 
egory. Because every exogenous variable appears in this equation, we have no IVs for y,. Therefore, 
we cannot consistently estimate the parameters of this equation. For the reasons we discussed in 
Section 16-2, OLS estimation will not usually be consistent. 

What about equation (16.27)? Things look promising because zz, z3, and z4 are all excluded from 
the equation—this is another example of exclusion restrictions. Although there are two endogenous 
variables in this equation, we have three potential IVs for y, and y. Therefore, equation (16.27) passes 
the order condition. For completeness, we state the order condition for general SEMs. 


Order Condition for Identification. An equation in any SEM satisfies the order condition 
for identification if the number of excluded exogenous variables from the equation is at least as large 
as the number of right-hand side endogenous variables. 

The second equation, (16.28), also passes the order condition because there is one excluded 
exogenous variable, z4, and one right-hand side endogenous variable, y,. 
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As we discussed in Chapter 15 and in the previous section, the order condition is only necessary, 
not sufficient, for identification. For example, if 63, = 0, z4 appears nowhere in the system, which 
means it is not correlated with y,, y2, or y3. If 63, = 0, then the second equation is not identified, 
because z, is useless as an IV for y,. This again illustrates that identification of an equation depends 
on the values of the parameters (which we can never know for sure) in the other equations. 

There are many subtle ways that identification can fail in complicated SEMs. To obtain sufficient 
conditions, we need to extend the rank condition for identification in two-equation systems. This is 
possible, but it requires matrix algebra [see, for example, Wooldridge (2010, Chapter 9)]. In many 
applications, one assumes that, unless there is obviously failure of identification, an equation that 
satisfies the order condition is identified. 

The nomenclature on overidentified and just identified equations from Chapter 15 originated 
with SEMs. In terms of the order condition, (16.27) is an overidentified equation because we need 
only two IVs (for y, and y3) but we have three available (z,, z3, and z4); there is one overidentify- 
ing restriction in this equation. In general, the number of overidentifying restrictions equals the total 
number of exogenous variables in the system minus the total number of explanatory variables in the 
equation. These can be tested using the overidentification test from Section 15-5. Equation (16.28) is 
a just identified equation, and the third equation is an unidentified equation. 


16-4b Estimation 


Regardless of the number of equations in an SEM, each identified equation can be estimated by 2SLS. 
The instruments for a particular equation consist of the exogenous variables appearing anywhere in 
the system. Tests for endogeneity, heteroskedasticity, serial correlation, and overidentifying restric- 
tions can be obtained, just as in Chapter 15. 

It turns out that, when any system with two or more equations is correctly specified and certain 
additional assumptions hold, system estimation methods are generally more efficient than estimat- 
ing each equation by 2SLS. The most common system estimation method in the context of SEMs 
is three stage least squares. These methods, with or without endogenous explanatory variables, are 
beyond the scope of this text. [See, for example, Wooldridge (2010, Chapters 7 and 8).] 


16-5 Simultaneous Equations Models with Time Series 


Among the earliest applications of SEMs was estimation of large systems of simultaneous equations 
that were used to describe a country’s economy. A simple Keynesian model of aggregate demand (that 
ignores exports and imports) is 


C, = Bo + Bi(Y, = T,) + Bor, + un [16.30] 
IL, = Yo + yır, + up [16.31] 
Yeo +546, [16.32] 


where 
C, = consumption, 
Y, = income, 
T, = tax receipts, 


r, = the interest rate, 


x 
II 


investment, and 


G, = government spending. 


[See, for example, Mankiw (1994, Chapter 9).] For concreteness, assume t represents year. 
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The first equation is an aggregate consumption function, where consumption depends on dispos- 
able income, the interest rate, and the unobserved structural error u,,. The second equation is a very 
simple investment function. Equation (16.32) is an identity that is a result of national income account- 
ing: it holds by definition, without error. Thus, there is no sense in which we estimate (16.32), but we 
need this equation to round out the model. 

Because there are three equations in the system, there must also be three endogenous variables. 
Given the first two equations, it is clear that we intend for C, and J, to be endogenous. In addition, 
because of the accounting identity, Y, is endogenous. We would assume, at least in this model, that T, 
r,, and G, are exogenous, so that they are uncorrelated with u, and u. (We will discuss problems with 
this kind of assumption later.) 

If r, is exogenous, then OLS estimation of equation (16.31) is natural. The consumption function, 
however, depends on disposable income, which is endogenous because Y, is. We have two instruments 
available under the maintained exogeneity assumptions: T, and G,. Therefore, if we follow our prescription 
for estimating cross-sectional equations, we would estimate (16.30) by 2SLS using instruments (T, G,, r,). 

Models such as (16.30) through (16.32) are seldom estimated now, for several good reasons. 
First, it is very difficult to justify, at an aggregate level, the assumption that taxes, interest rates, and 
government spending are exogenous. Taxes clearly depend directly on income; for example, with a 
single marginal income tax rate 7, in year t, T, = 7,Y,. We can easily allow this by replacing (Y, — T,) 
with (1 — 7,)¥, in (16.30), and we can still estimate the equation by 2SLS if we assume that govern- 
ment spending is exogenous. We could also add the tax rate to the instrument list, if it is exogenous. 
But are government spending and tax rates really exogenous? They certainly could be in principle, 
if the government sets spending and tax rates independently of what is happening in the economy. 
But it is a difficult case to make in reality: government spending generally depends on the level of 
income, and at high levels of income, the same tax receipts are collected for lower marginal tax rates. 
In addition, assuming that interest rates are exogenous is extremely questionable. We could specify a 
more realistic model that includes money demand and supply, and then interest rates could be jointly 
determined with C, J,, and Y,. But then finding enough exogenous variables to identify the equations 
becomes quite difficult (and the following problems with these models still pertain). 

Some have argued that certain components of government spending, such as defense spending— 
see, for example, Hall (1988) and Ramey (1991)—are exogenous in a variety of simultaneous equa- 
tions applications. But this is not universally agreed upon, and, in any case, defense spending is not 
always appropriately correlated with the endogenous explanatory variables [see Shea (1993) for dis- 
cussion and Computer Exercises C6 for an example]. 

A second problem with a model such as (16.30) through (16.32) is that it is completely static. 
Especially with monthly or quarterly data, but even with annual data, we often expect adjustment 
lags. (One argument in favor of static Keynesian-type models is that they are intended to describe the 
long run without worrying about short-run dynamics.) Allowing dynamics is not very difficult. For 
example, we could add lagged income to equation (16.31): 


L, = Yo + Yir, + YY,-1 + Up. [16.33] 


In other words, we add a lagged endogenous variable (but not /,_ ,) to the investment equation. Can 
we treat Y,_; as exogenous in this equation? Under certain assumptions on u,, the answer is yes. But 
we typically call a lagged endogenous variable in an SEM a predetermined variable. Lags of exog- 
enous variables are also predetermined. If we assume that u,. is uncorrelated with current exogenous 
variables (which is standard) and all past endogenous and exogenous variables, then Y,_, is uncor- 
related with up. Given exogeneity of r,, we can estimate (16.33) by OLS. 

If we add lagged consumption to (16.30), we can treat C,_, as exogenous in this equation under 
the same assumptions on u,, that we made for up in the previous paragraph. Current disposable 
income is still endogenous in 


C, = Bo + BY, 7 T,) + Por, + B;Ci-1 + Un, [16.34] 
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so we could estimate this equation by 2SLS using instruments (T, G, r, C,_,); if investment is 
determined by (16.33), Y,_, should be added to the instrument list. [To see why, use (16.32), (16.33), 
and (16.34) to find the reduced form for Y, in terms of the exogenous and predetermined variables: T, 
r» G» C,_,, and Y,_,. Because Y,_, shows up in this reduced form, it should be used as an IV.] 

The presence of dynamics in aggregate SEMs is, at least for the purposes of forecasting, a clear 
improvement over static SEMs. But there are still some important problems with estimating SEMs using 
aggregate time series data, some of which we discussed in Chapters 11 and 15. Recall that the validity 
of the usual OLS or 2SLS inference procedures in time series applications hinges on the notion of weak 
dependence. Unfortunately, series such as aggregate consumption, income, investment, and even interest 
rates seem to violate the weak dependence requirements. (In the terminology of Chapter 11, they have 
unit roots.) These series also tend to have exponential trends, although this can be partly overcome by 
using the logarithmic transformation and assuming different functional forms. Generally, even the large 
sample, let alone the small sample, properties of OLS and 2SLS are complicated and dependent on vari- 
ous assumptions when they are applied to equations with I(1) variables. We will briefly touch on these 
issues in Chapter 18. An advanced, general treatment is given by Hamilton (1994). 

Does the previous discussion mean that SEMs are not usefully applied to time series data? Not 
at all. The problems with trends and high persistence can be avoided by specifying systems in first 
differences or growth rates. But one should recognize that this is a different SEM than one specified 
in levels. [For example, if we specify consumption growth as a function of disposable income growth 
and interest rate changes, this is different from (16.30).] Also, as we discussed earlier, incorporat- 
ing dynamics is not especially difficult. Finally, the problem of finding truly exogenous variables to 
include in SEMs is often easier with disaggregated data. For example, for manufacturing industries, 
Shea (1993) describes how output (or, more precisely, growth in output) in other industries can be 
used as an instrument in estimating supply functions. Ramey (1991) also has a convincing analysis of 
estimating industry cost functions by instrumental variables using time series data. 

The next example shows how aggregate data can be used to test an important economic theory, the 
permanent income theory of consumption, usually called the permanent income hypothesis (PIH). The 
approach used in this example is not, strictly speaking, based on a simultaneous equations model, but 
we can think of consumption and income growth (as well as interest rates) as being jointly determined. 


Testing the Permanent Income Hypothesis 


Campbell and Mankiw (1990) used instrumental variables methods to test various versions of the 
PIH. We will use the annual data from 1959 through 1995 in CONSUMP to mimic one of their analy- 
ses. Campbell and Mankiw used quarterly data running through 1985. 
One equation estimated by Campbell and Mankiw (using our notation) is 
&C, = Bo t Pigy, T Bar3, T Us, [16.35] 


where 


gc, = Alog(c,) = annual growth in real per capita consumption (excluding durables), 


gy, = growth in real disposable income, and 


r3, = the (ex post) real interest rate as measured by the return on three-month T-bill 
rates: r3, = i3, — inf,, where the inflation rate is based on the Consumer Price Index. 


The growth rates of consumption and disposable income are not trending, and they are weakly 
dependent; we will assume this is the case for r3, as well, so that we can apply standard asymptotic 
theory. 

The key feature of equation (16.35) is that the PIH implies that the error term u, has a zero mean 
conditional on all information observed at time t — 1 or earlier: E(uJ,,) = 0. However, u, is not 
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necessarily uncorrelated with gy, or r3,; a traditional way to think about this is that these variables are 
jointly determined, but we are not writing down a full three-equation system. 

Because u, is uncorrelated with all variables dated ¢ — 1 or earlier, valid instruments for estimat- 
ing (16.35) are lagged values of gc, gy, and r3 (and lags of other observable variables, but we will 
not use those here). What are the hypotheses of interest? The pure form of the PIH has 6, = B, = 0. 
Campbell and Mankiw argue that 6; is positive if some fraction of the population consumes current 
income, rather than permanent income. The PIH with a nonconstant real interest rate implies that 
B, > 0. 

When we estimate (16.35) by 2SLS, using instruments gc_,, gy_,, and r3_, for the endogenous 
variables gy, and r3,, we obtain 


@c, = .0081 + .586 gy, — .00027r3, 
(.0032) (.135)  (.00076) [16.36] 
n = 35, R? = .678. 


Therefore, the pure form of the PIH is strongly rejected because the coefficient on gy is economically 
large (a 1% increase in disposable income increases consumption by over .5%) and statistically 
significant (t = 4.34). By contrast, the real interest rate coefficient is very small and statistically 
insignificant. These findings are qualitatively the same as Campbell and Mankiw’s. 

The PIH also implies that the errors {u,} are serially uncorrelated. After 2SLS estimation, we 
obtain the residuals, iz,, and include #,_ , as an additional explanatory variable in (16.36); we still use 
instruments gc,_, 8Y,- 1, 73,1, and ii,_ acts as its own instrument (see Section 15-7). The coefficient 
on ii,_, is Ô = .187 (se = .133), so there is some evidence of positive serial correlation, although not 
at the 5% significance level. Campbell and Mankiw discuss why, with the available quarterly data, 
positive serial correlation might be found in the errors even if the PIH holds; some of those concerns 
carry over to annual data. 


Using growth rates of trending or I(1) variables 
in SEMs is fairly common in time series applications. 
For example, Shea (1993) estimates industry supply 
curves specified in terms of growth rates. 

If a structural model contains a time trend— 
which may capture exogenous, trending factors that 
are not directly modeled—then the trend acts as its 
own IV. 


al 


i GOING FURTHER 16.4 


Suppose that for a particular city you have 
monthly data on per capita consumption 
of fish, per capita income, the price of fish, 
and the prices of chicken and beef; income 
and chicken and beef prices are exogenous. 
Assume that there is no seasonality in the 
demand function for fish, but there is in the 
supply of fish. How can you use this infor- 
mation to estimate a constant elasticity 
demand-for-fish equation? Specify an equa- 
tion and discuss identification. (Hint: You 
should have 11 instrumental variables for 
the price of fish.) 


16-6 Simultaneous Equations Models with Panel Data 


Simultaneous equations models also arise in panel data contexts. For example, we can imagine esti- 
mating labor supply and wage offer equations, as in Example 16.3, for a group of people working 
over a given period of time. In addition to allowing for simultaneous determination of variables within 
each time period, we can allow for unobserved effects in each equation. In a labor supply function, it 
would be useful to allow an unobserved taste for leisure that does not change over time. 
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The basic approach to estimating SEMs with panel data involves two steps: (1) eliminate the 
unobserved effects from the equations of interest using the fixed effects transformation or first differ- 
encing and (2) find instrumental variables for the endogenous variables in the transformed equation. 
This can be very challenging because, for a convincing analysis, we need to find instruments that 
change over time. To see why, write an SEM for panel data as 


Yin = UYia + Zabi + an + vin [16.37] 
Yin = Win + ZnB. + an + tin, [16.38] 


where i denotes cross section, t denotes time period, and Z,;,; 8, or Z;,2( denotes linear functions of a 
set of exogenous explanatory variables in each equation. The most general analysis allows the unob- 
served effects, a; and aj, to be correlated with all explanatory variables, even the elements in z. 
However, we assume that the idiosyncratic structural errors, uj, and uj, are uncorrelated with the z 
in both equations and across all time periods; this is the sense in which the z are exogenous. Except 
under special circumstances, yj. is correlated with u; and y; is correlated with uj. 

Suppose we are interested in equation (16.37). We cannot estimate it by OLS, as the composite 
error a; + uin is potentially correlated with all explanatory variables. Suppose we difference over 
time to remove the unobserved effect, aj: 


AYin = Ayin + Azin By + Aun. [16.39] 


(As usual with differencing or time-demeaning, we can only estimate the effects of variables that 
change over time for at least some cross-sectional units.) Now, the error term in this equation is 
uncorrelated with Az;,, by assumption. But Ay; and Au; are possibly correlated. Therefore, we need 
an IV for Ayjn. 

As with the case of pure cross-sectional or pure time series data, possible IVs come from the 
other equation: elements in Z;,. that are not also in Z. In practice, we need time-varying elements in 
Zio that are not also in z;,,. This is because we need an instrument for Ay;, and a change in a variable 
from one period to the next is unlikely to be highly correlated with the level of exogenous variables. 
In fact, if we difference (16.38), we see that the natural IVs for Ay, are those elements in Az;,. that 
are not also in Az;,. 

As an example of the problems that can arise, consider a panel data version of the labor supply 
function in Example 16.3. After differencing, suppose we have the equation 


Ahours;, = Bo + a,Alog(wage;,) + A(other factors;,), 


and we wish to use Aexper,, as an instrument for Alog(wage,,). The problem is that, because we are 
looking at people who work in every time period, Aexper;, = 1 for all i and t. (Each person gets another 
year of experience after a year passes.) We cannot use an IV that is the same value for all i and t, 
and so we must look elsewhere. One possibility as an instrument for Alog(wage;,) is the change in the 
minimum wage at the state or local level. (As of January 2018, more than 40 localities in the United 
States have minimum wages above the state minimum wage.) Naturally, in the labor supply function, 
and, therefore, in the reduced form for Alog(wage;,), one should include a full set of dummy variables 
for the different time periods in order to render changes in the minimum wage exogenous to the indi- 
vidual labor supply equation. 

Often, participation in an experimental program can be used to obtain IVs in panel data contexts. 
In Example 15.10, we used receipt of job training grants as an IV for the change in hours of training 
in determining the effects of job training on worker productivity. In fact, we could view that in an 
SEM context: job training and worker productivity are jointly determined, but receiving a job training 
grant is exogenous in equation (15.57). 

One can sometimes come up with clever, convincing instrumental variables in panel data applica- 
tions, as the following example illustrates. 
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EXAMPLE 16.8 Effect of Prison Population on Violent Crime Rates 


In order to estimate the causal effect of prison population increases on crime rates at the state level, 
Levitt (1996) used instances of prison overcrowding litigation as instruments for the growth in prison 
population. The equation Levitt estimated is in first differences; we can write an underlying fixed 
effects model as 


log(crime,) = 0, + alog(prison;,) + ZB) + ay + Ui [16.40] 


where 0, denotes different time intercepts, and crime and prison are measured per 100,000 people. 
(The prison population variable is measured on the last day of the previous year.) The vector Z; con- 
tains log of police per capita, log of income per capita, the unemployment rate, proportions of black 
and those living in metropolitan areas, and age distribution proportions. 

Differencing (16.40) gives the equation estimated by Levitt: 


Alog(crime;,) = é, + a,Alog(prison;,) + Aziz Bı + Aui. [16.41] 


Simultaneity between crime rates and prison population, or more precisely in the growth rates, makes 
OLS estimation of (16.41) generally inconsistent. Using the violent crime rate and a subset of the 
data from Levitt (in PRISON, for the years 1980 through 1993, for 51-14 = 714 total observations), 
we obtain the pooled OLS estimate of œ}, which is —.181 (se = .048). We also estimate (16.41) by 
pooled 2SLS, where the instruments for Alog(prison) are two binary variables, one each for whether a 
final decision was reached on overcrowding litigation in the current year or in the previous two years. 
The pooled 2SLS estimate of a, is — 1.032 (se = .370). Therefore, the 2SLS estimated effect is much 
larger; not surprisingly, it is much less precise, too. Levitt found similar results when using a longer 
time period (but with early observations missing for some states) and more instruments. 


Testing for AR(1) serial correlation in rı = Au; is easy. After the pooled 2SLS estimation, 
obtain the residuals, 7;,,. Then, include one lag of these residuals in the original equation, and esti- 
mate the equation by 2SLS, where 7;,; acts as its own instrument. The first year is lost because of the 
lagging. Then, the usual 2SLS f statistic on the lagged residual is a valid test for serial correlation. In 
Example 16.8, the coefficient on 7;,; is only about .076 with t = 1.67. With such a small coefficient 
and modest f statistic, we can safely assume serial independence. 

An alternative approach to estimating SEMs with panel data is to use the fixed effects transfor- 
mation and then to apply an IV technique such as pooled 2SLS. A simple procedure is to estimate the 
time-demeaned equation by pooled 2SLS, which would look like 


Vin = N02 + ZnB, + tin, t= 1,2,...,T7, [16.42] 


where %;,; and 2. are IVs. This is equivalent to using 2SLS in the dummy variable formulation, where 
the unit-specific dummy variables act as their own instruments. Ayres and Levitt (1998) applied 2SLS 
to a time-demeaned equation to estimate the effect of LoJack electronic theft prevention devices on car 
theft rates in cities. If (16.42) is estimated directly, then the df needs to be corrected to N(T — 1) — ką, 
where k; is the total number of elements in a, and p. Including unit-specific dummy variables and 
applying pooled 2SLS to the original data produces the correct df. A detailed treatment of 2SLS with 
panel data is given in Wooldridge (2010, Chapter 11). 


Summary 


Simultaneous equations models are appropriate when firmly grounded in counterfactual reasoning. In par- 
ticular, each equation in the system should have a ceteris paribus interpretation. Good examples are when 
separate equations describe different sides of a market or the behavioral relationships of different economic 
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agents. Supply and demand examples are leading cases, but there are many other applications of SEMs in 
economics and the social sciences. 

An important feature of SEMs is that, by fully specifying the system, it is clear which variables are 
assumed to be exogenous and which ones appear in each equation. Given a full system, we are able to 
determine which equations can be identified (that is, can be estimated). In the important case of a two- 
equation system, identification of (say) the first equation is easy to state: at least one exogenous variable 
must be excluded from the first equation that appears with a nonzero coefficient in the second equation. 

As we know from previous chapters, OLS estimation of an equation that contains an endogenous 
explanatory variable generally produces biased and inconsistent estimators. Instead, 2SLS can be used to 
estimate any identified equation in a system. More advanced system methods are available, but they are 
beyond the scope of our treatment. 

The distinction between omitted variables and simultaneity in applications is not always sharp. Both prob- 
lems, not to mention measurement error, can appear in the same equation. A good example is the labor supply 
of married women. Years of education (educ) appears in both the labor supply and the wage offer functions [see 
equations (16.19) and (16.20)]. If omitted ability is in the error term of the labor supply function, then wage and 
education are both endogenous. The important thing is that an equation estimated by 2SLS can stand on its own. 

SEMs can be applied to time series data as well. As with OLS estimation, we must be aware of trend- 
ing, integrated processes in applying 2SLS. Problems such as serial correlation can be handled as in Section 
15-7. We also gave an example of how to estimate an SEM using panel data, where the equation is first dif- 
ferenced to remove the unobserved effect. Then, we can estimate the differenced equation by pooled 2SLS, 
just as in Chapter 15. Alternatively, in some cases, we can use time-demeaning of all variables, including 
the IVs, and then apply pooled 2SLS; this is identical to putting in dummies for each cross-sectional obser- 
vation and using 2SLS, where the dummies act as their own instruments. SEM applications with panel data 
are very powerful, as they allow us to control for unobserved heterogeneity while dealing with simultane- 
ity. They are becoming more and more common and are not especially difficult to estimate. 


Key Terms 


Endogenous Variables Overidentified Equation Simultaneity Bias 
Exclusion Restrictions Predetermined Variable Simultaneous Equations 
Exogenous Variables Rank Condition Model (SEM) 
Identified Equation Reduced Form Equation Structural Equation 

Just Identified Equation Reduced Form Error Structural Errors 
Lagged Endogenous Variable Reduced Form Parameters Structural Parameters 
Order Condition Simultaneity Unidentified Equation 


| Problems 


1 Write a two-equation system in “supply and demand form,” that is, with the same variable y, (typically, 
“quantity”) appearing on the left-hand side: 


yı = ayyz + Biz + u 
Yi = Q2 + Bz + Uy. 


(i) Ifa, = 0 or œ, = 0, explain why a reduced form exists for y,. (Remember, a reduced form 
expresses y, as a linear function of the exogenous variables and the structural errors.) If a, # 0 
and a, = 0, find the reduced form for y2. 

(ii) Ifa, # 0, œ, # 0, and a, # ay, find the reduced form for y,. Does y, have a reduced form in 
this case? 

(iii) Is the condition a, # a, likely to be met in supply and demand examples? Explain. 
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2 Let corn denote per capita consumption of corn in bushels at the county level, let price be the price per 
bushel of corn, let income denote per capita county income, and let rainfall be inches of rainfall during 
the last corn-growing season. The following simultaneous equations model imposes the equilibrium 
condition that supply equals demand: 


corn = ayprice + B,income + u 


corn = a price + Borainfall + y rainfall’ + uy. 
Which is the supply equation, and which is the demand equation? Explain. 
3 In Problem 3 of Chapter 3, we estimated an equation to test for a tradeoff between minutes per week 
spent sleeping (sleep) and minutes per week spent working (totwrk) for a random sample of individu- 
als. We also included education and age in the equation. Because sleep and totwrk are jointly chosen 


by each individual, is the estimated tradeoff between sleeping and working subject to a “simultaneity 
bias” criticism? Explain. 


4 Suppose that annual earnings and alcohol consumption are determined by the SEM 
log(earnings) = By + Byalcohol + Breduc + uy 
alcohol = yo + y,log(earnings) + y,educ + y,log(price) + u, 
where price is a local price index for alcohol, which includes state and local taxes. Assume that educ 


and price are exogenous. If B,, B2, Yi, Y2, and y; are all different from zero, which equation is identi- 
fied? How would you estimate that equation? 


5 A simple model to determine the effectiveness of condom usage on reducing sexually transmitted dis- 
eases among sexually active high school students is 


infrate = By + B,conuse + B,percmale + Bzavginc + Bycity + u, 
where 


infrate = the percentage of sexually active students who have contracted venereal disease. 
conuse = the percentage of boys who claim to use condoms regularly. 
avginc = average family income. 
city = a dummy variable indicating whether a school is in a city. 
The model is at the school level. 
(i) Interpreting the preceding equation in a causal, ceteris paribus fashion, what should be the 
sign of B,? 
(i) Why might infrate and conuse be jointly determined? 
(iii) If condom usage increases with the rate of venereal disease, so that y; > 0 in the equation 


conuse = Yo + y,infrate + other factors, 


what is the likely bias in estimating B, by OLS? 

(iv) Let condis be a binary variable equal to unity if a school has a program to distribute condoms. 
Explain how this can be used to estimate 6, (and the other betas) by IV. What do we have to 
assume about condis in each equation? 


6 Consider a linear probability model for whether employers offer a pension plan based on the percentage 
of workers belonging to a union, as well as other factors: 


pension = By + B\percunion + Boavgage + B3,avgeduc 
+ B,ypercmale + Bspercmarr + uy. 


G) Why might percunion be jointly determined with pension? 
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(ii) Suppose that you can survey workers at firms and collect information on workers’ families. Can 
you think of information that can be used to construct an IV for percunion? 
(iii) How would you test whether your variable is at least a reasonable IV candidate for percunion? 


7 For a large university, you are asked to estimate the demand for tickets to women’s basketball games. 
You can collect time series data over 10 seasons, for a total of about 150 observations. One possible 
model is 

IATTEND, = Bo + B,I/PRICE, + B,WINPERC, + B3RIVAL, 
+ BsWEEKEND, + Pst + u, 


where 


PRICE, = the price of admission, probably measured in real terms—say, 
deflating by a regional consumer price index. 


WINPERC, = the team’s current winning percentage. 
RIVAL, = a dummy variable indicating a game against a rival. 


WEEKEND, = a dummy variable indicating whether the game is on a weekend. 


The / denotes natural logarithm, so that the demand function has a constant price elasticity. 

(i) Why is it a good idea to have a time trend in the equation? 

(ii) The supply of tickets is fixed by the stadium capacity; assume this has not changed over the 
10 years. This means that quantity supplied does not vary with price. Does this mean that price 
is necessarily exogenous in the demand equation? (Hint: The answer is no.) 

(iii) Suppose that the nominal price of admission changes slowly—say, at the beginning of each 
season. The athletic office chooses price based partly on last season’s average attendance, as 
well as last season’s team success. Under what assumptions is last season’s winning percentage 
(SEASPERC,_,) a valid instrumental variable for [PRICE,? 

(iv) Does it seem reasonable to include the (log of the) real price of men’s basketball games in the 
equation? Explain. What sign does economic theory predict for its coefficient? Can you think 
of another variable related to men’s basketball that might belong in the women’s attendance 
equation? 

(v) If you are worried that some of the series, particularly [ATTEND and IPRICE, have unit roots, 
how might you change the estimated equation? 

(vi) If some games are sold out, what problems does this cause for estimating the demand function? 
(Hint: If a game is sold out, do you necessarily observe the true demand?) 


8 How big is the effect of per-student school expenditures on local housing values? Let HPRICE be the 
median housing price in a school district and let EXPEND be per-student expenditures. Using panel 
data for the years 1992, 1994, and 1996, we postulate the model 


IHPRICE,, = 0, + B\IEXPEND,, + B,IPOLICE, + B,IMEDINC, 
+ ByPROPTAX, + ay, + tins 


where POLICE; is per capita police expenditures, MEDINC;, is median income, and PROPTAX,, is 
the property tax rate; / denotes natural logarithm. Expenditures and housing price are simultaneously 
determined because the value of homes directly affects the revenues available for funding schools. 

Suppose that, in 1994, the way schools were funded was drastically changed: rather than 
being raised by local property taxes, school funding was largely determined at the state level. Let 
ISTATEALL,, denote the log of the state allocation for district i in year t, which is exogenous in the 
preceding equation, once we control for expenditures and a district fixed effect. How would you 
estimate the 8,? 
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Computer Exercises 


C1 Use SMOKE for this exercise. 


C2 


C3 


C4 


G) 


A model to estimate the effects of smoking on annual income (perhaps through lost work days 
due to illness, or productivity effects) is 


log(income) = By + Bycigs + Breduc + Bage + Bage? + u, 


where cigs is number of cigarettes smoked per day, on average. How do you interpret B,? 


(ii) 


(iii) 
(iv) 
(v) 

(vi) 


(vii) 


To reflect the fact that cigarette consumption might be jointly determined with income, a 
demand for cigarettes equation is 


cigs = Yo + ylog(income) + yreduc + y,age + ysage* 
F yslog(cigpric) + Yerestaurn + Ud, 


where cigpric is the price of a pack of cigarettes (in cents) and restaurn is a binary variable 
equal to unity if the person lives in a state with restaurant smoking restrictions. Assuming these 
are exogenous to the individual, what signs would you expect for ys and y6? 

Under what assumption is the income equation from part (i) identified? 

Estimate the income equation by OLS and discuss the estimate of £4. 

Estimate the reduced form for cigs. (Recall that this entails regressing cigs on all exogenous 
variables.) Are log(cigpric) and restaurn significant in the reduced form? 

Now, estimate the income equation by 2SLS. Discuss how the estimate of 8, compares with the 
OLS estimate. 

Do you think that cigarette prices and restaurant smoking restrictions are exogenous in the 
income equation? 


Use MROZ for this exercise. 


©) 


(ii) 


(iii) 


Reestimate the labor supply function in Example 16.5, using log(hours) as the dependent vari- 
able. Compare the estimated elasticity (which is now constant) to the estimate obtained from 
equation (16.24) at the average hours worked. 

In the labor supply equation from part (i), allow educ to be endogenous because of omitted 
ability. Use motheduc and fatheduc as IVs for educ. Remember, you now have two endogenous 
variables in the equation. 

Test the overidentifying restrictions in the 2SLS estimation from part (ii). Do the IVs pass the test? 


Use the data in OPENNESS for this exercise. 


(i) 


(ii) 


(iii) 


Because log(pcinc) is insignificant in both (16.22) and the reduced form for open, drop it from 
the analysis. Estimate (16.22) by OLS and IV without log(pcinc). Do any important conclusions 
change? 

Still leaving log(pcinc) out of the analysis, is land or log(/and) a better instrument for open? 
(Hint: Regress open on each of these separately and jointly.) 

Now, return to (16.22). Add the dummy variable oil to the equation and treat it as exogenous. 
Estimate the equation by IV. Does being an oil producer have a ceteris paribus effect on 
inflation? 


Use the data in CONSUMP for this exercise. 


G) 


(ii) 


(iii) 


In Example 16.7, use the method from Section 15-5 to test the single overidentifying restriction 
in estimating (16.35). What do you conclude? 

Campbell and Mankiw (1990) use second lags of all variables as IVs because of potential data 
measurement problems and informational lags. Reestimate (16.35), using only gc,_2, gy,;—2, and 
r3,-2 as IVs. How do the estimates compare with those in (16.36)? 

Regress gy, on the IVs from part (ii) and test whether gy, is sufficiently correlated with them. 
Why is this important? 
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C5 


C6 


C7 


C8 


Use the Economic Report of the President (2005 or later) to update the data in CONSUMP, at least 
through 2003. Reestimate equation (16.35). Do any important conclusions change? 


Use the data in CEMENT for this exercise. 
(i) A static (inverse) supply function for the monthly growth in cement price (gprc) as a function of 
growth in quantity (gcem) is 


gprc, = aygcem, + Bo + Bigprcpet + Bofeb, +++: + Bidec, + uj, 


where gprcpet (growth in the price of petroleum) is assumed to be exogenous and feb, . . . , dec 
are monthly dummy variables. What signs do you expect for a, and 6,? Estimate the equation 
by OLS. Does the supply function slope upward? 

(ii) The variable gdefs is the monthly growth in real defense spending in the United States. What do 
you need to assume about gdefs for it to be a good IV for gcem? Test whether gcem is partially 
correlated with gdefs. (Do not worry about possible serial correlation in the reduced form.) Can 
you use gdefs as an IV in estimating the supply function? 

(iii) Shea (1993) argues that the growth in output of residential (gres) and nonresidential (gnon) con- 
struction are valid instruments for gcem. The idea is that these are demand shifters that should 
be roughly uncorrelated with the supply error u;. Test whether gcem is partially correlated with 
gres and gnon; again, do not worry about serial correlation in the reduced form. 

(iv) Estimate the supply function, using gres and gnon as IVs for gcem. What do you conclude about 
the static supply function for cement? [The dynamic supply function is, apparently, upward 
sloping; see Shea (1993).] 


Refer to Example 13.9 and the data in CRIMEF4, 

(i) | Suppose that, after differencing to remove the unobserved effect, you think Alog(polpc) is 
simultaneously determined with Alog(crmrte); in particular, increases in crime are associ- 
ated with increases in police officers. How does this help to explain the positive coefficient on 
Alog(polpc) in equation (13.33)? 

(ii) The variable taxpc is the taxes collected per person in the county. Does it seem reasonable to 
exclude this from the crime equation? 

(iii) Estimate the reduced form for Alog(polpc) using pooled OLS, including the potential IV, 
Alog(taxpc). Does it look like Alog(taxpc) is a good IV candidate? Explain. 

(iv) Suppose that, in several of the years, the state of North Carolina awarded grants to some coun- 
ties to increase the size of their county police force. How could you use this information to esti- 
mate the effect of additional police officers on the crime rate? 


Use the data set in FISH, which comes from Graddy (1995), to do this exercise. The data set is also used 
in Computer Exercise C9 in Chapter 12. Now, we will use it to estimate a demand function for fish. 
(i) | Assume that the demand equation can be written, in equilibrium for each time period, as 


log(totqty,) = alog(avgpre,) + Bio + Byymon, + By2tues, 
+ By3wed, + Byathurs, + uy, 


so that demand is allowed to differ across days of the week. Treating the price variable as 
endogenous, what additional information do we need to estimate the demand-equation param- 
eters consistently? 

(ii) The variables wave2, and wave3, are measures of ocean wave heights over the past several 
days. What two assumptions do we need to make in order to use wave2, and wave3, as IVs for 
log(avgprc,) in estimating the demand equation? 

(iii) Regress log(avgprc,) on the day-of-the-week dummies and the two wave measures. Are wave2, 
and wave3, jointly significant? What is the p-value of the test? 

(iv) Now, estimate the demand equation by 2SLS. What is the 95% confidence interval for the price 
elasticity of demand? Is the estimated elasticity reasonable? 


c9 


C10 


C11 
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(v) Obtain the 2SLS residuals, #,. Add a single lag, i,_ ,,, in estimating the demand equation by 
2SLS. Remember, use i, ;,, aS its own instrument. Is there evidence of AR(1) serial correlation 
in the demand equation errors? 

(vi) Given that the supply equation evidently depends on the wave variables, what two assumptions 
would we need to make in order to estimate the price elasticity of supply? 

(vii) In the reduced form equation for log(avgprc,), are the day-of-the-week dummies jointly signifi- 
cant? What do you conclude about being able to estimate the supply elasticity? 


For this exercise, use the data in AIRFARE, but only for the year 1997. 
(i) A simple demand function for airline seats on routes in the United States is 


log(passen) = Bio + a,log(fare) + B,,log(dist) + B,[log(dist) P + u,, 
where 


passen = average passengers per day, 
fare = average airfare, and 


dist = the route distance (in miles). 


If this is truly a demand function, what should be the sign of a? 

(ii) Estimate the equation from part (i) by OLS. What is the estimated price elasticity? 

(iii) Consider the variable concen, which is a measure of market concentration. (Specifically, it is the 
share of business accounted for by the largest carrier.) Explain in words what we must assume 
to treat concen as exogenous in the demand equation. 

(iv) Now assume concen is exogenous to the demand equation. Estimate the reduced form for 
log(fare) and confirm that concen has a positive (partial) effect on log(fare). 

(v) Estimate the demand function using IV. Now what is the estimated price elasticity of demand? 
How does it compare with the OLS estimate? 

(vi) Using the IV estimates, describe how demand for seats depends on route distance. 


Use the entire panel data set in AIRFARE for this exercise. The demand equation in a simultaneous 
equations unobserved effects model is 


log(passen;,) = 6, + a,log(fare;,) + ay + Uin, 


where we absorb the distance variables into aj. 

(i) Estimate the demand function using fixed effects, being sure to include year dummies to 
account for the different intercepts. What is the estimated elasticity? 

Gi) Use fixed effects to estimate the reduced form 


log(fare;,) = 0. + 1,concen;, + an + Vip. 


Perform the appropriate test to ensure that concen; can be used as an IV for log(fare;,). 
(iii) Now estimate the demand function using the fixed effects transformation along with IV, as in 
equation (16.42). What is the estimated elasticity? Is it statistically significant? 


A common method for estimating Engel curves is to model expenditure shares as a function of total 
expenditure, and possibly demographic variables. A common specification has the form 


sgood = By + PB, ltotexpend + demographics + u, 


where sgood is the fraction of spending on a particular good out of total expenditure and /totexpend is 
the log of total expenditure. The sign and magnitude of 6, are of interest across various expenditure 
categories. 

To account for the potential endogeneity of /totexpend—which can be viewed as an omitted 
variables or simultaneous equations problem, or both—the log of family income is often used as an 
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C12 


instrumental variable. Let Jincome denote the log of family income. For the remainder of this question, 

use the data in EXPENDSHARES, which comes from Blundell, Duncan, and Pendakur (1998). 

(i) Use sfood, the share of spending on food, as the dependent variable. What is the range of values 
of sfood? Are you surprised there are no zeros? 

(ii) Estimate the equation 


sfood = By + B,ltotexpend + Bage + B3kids + u [16.43] 


by OLS and report the coefficient on /totexpend, Bois. p along with its heteroskedasticity-robust 
standard error. Interpret the result. 

(iii) Using lincome as an IV for Itotexpend, estimate the reduced form equation for /totexpend; be 
sure to include age and kids. Assuming lincome is exogenous in (16.43), is lincome a valid IV 
for Itotexpend? 

(iv) Now estimate (16.43) by instrumental variables. How does Br. ı compare with Boisi? What 
about the robust 95% confidence intervals? 

(v) Use the test in Section 15-5 to test the null hypothesis that /totexpend is exogenous in (16.43). 
Be sure to report and interpret the p-value. Are there any overidentifying restrictions to test? 

(vi) Substitute salcohol for sfood in (16.43) and estimate the equation by OLS and 2SLS. Now what 
do you find for the coefficients on /totexpend? 


Use the data in PRISON.DTA to answer the following questions. Refer to Example 16.8. In the data 
set, variables beginning with “g” are growth rates from one year to the next, obtained as the changes 
in the natural log. For example, gcriv;, = log(criv;,) — log(criv;,_,). Variables beginning with “c” are 
changes in levels from one year to the next, for example, cunem;, = unem; — unem; 1- 


(i) Estimate the equation 
BcCTIV;, = €, + aygpris;, + Bygincpc;, + Prgpolpc;, + B3cag0_14;, + Bycag15_17;, 
+ Bscag18_24,,+ Becag25_34;, + Bycunem; + Bgcblack, + Bocmetro;, + Auj, 
by OLS and verify that you obtain a, = —0.181 (se = .048). The parameters é, are to remind 
you to include year dummies for 1981 through 1993. 
(ii) Estimate the reduced form equation for gpris;,, where final] ;, and final2;, are the instruments: 
gprisy, = M, + y finally, + y2final2;, + m gincpci + m2gpolpcy, + m73cag0_14ir 
+ q,4cag15_17,,+ ms5cag18_ 24;, + mecag25_34; + m7cunem;, + Tgcblack; 


+ qocmetro;, + Au; 


Verify that yı and y, are both negative. Are they each statistically significant? What is the F 
statistic for Hp : yı = y2 = 0? (Remember again to put in a full set of year dummies.) 

(iii) Obain the 2SLS estimates of the equation in part (i), using final1,, and final2;, as instruments for 
gpris;,. Verify that you obtain @, = — 1.032 (se = .370). 

(iv) If you have access to econometrics software that computes standard errors robust to heteroske- 
dasticity and serial correlation, obtain them for the 2SLS estimate in part (iii). What happens to 
the standard error of a? 

(v) Reestimate the reduced form in part (ii) using the differences, Afinal1, and Afinal2;,, as the 
instruments. (You will lose 1980 in differencing the instruments.) Do Afinal1;, and Afinal2;, 
seem like sufficiently strong instruments for gpris„? In particular, do you prefer using the 
differences or levels as the IVS? Estimate the reduced form in part (ii) dropping 1980 to be 
sure that you reach your conclusion on instrument strength using the same set of data. 


Limited Dependent 


Variable Models and 
Sample Selection 
Corrections 


n Chapter 7, we studied the linear probability model, which is simply an application of the multiple 

regression model to a binary dependent variable. A binary dependent variable is an example of a 

limited dependent variable (LDV). An LDV is broadly defined as a dependent variable whose 
range of values is substantively restricted. A binary variable takes on only two values, zero and one. 
In Section 7-7, we discussed the interpretation of multiple regression estimates for generally discrete 
response variables, focusing on the case where y takes on a small number of integer values—for 
example, the number of times a young man is arrested during a year or the number of children born 
to a woman. Elsewhere, we have encountered several other limited dependent variables, including 
the percentage of people participating in a pension plan (which must be between zero and 100) and 
college grade point average (which is between zero and 4.0 at most colleges). 

Most economic variables we would like to explain are limited in some way, often because they 
must be positive. For example, hourly wage, housing price, and nominal interest rates must be greater 
than zero. But not all such variables need special treatment. If a strictly positive variable takes on 
many different values, a special econometric model is rarely necessary. When y is discrete and takes 
on a small number of values, it makes no sense to treat it as an approximately continuous variable. 
Discreteness of y does not in itself mean that linear models are inappropriate. However, as we saw in 
Chapter 7 for binary response, the linear probability model has certain drawbacks. In Section 17-1, 
we discuss logit and probit models, which overcome the shortcomings of the LPM; the disadvantage 


is that they are more difficult to interpret. 
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Other kinds of limited dependent variables arise in econometric analysis, especially when the 
behavior of individuals, families, or firms is being modeled. Optimizing behavior often leads to a 
corner solution response for some nontrivial fraction of the population. That is, it is optimal to 
choose a zero quantity or dollar value, for example. During any given year, a significant number of 
families will make zero charitable contributions. Therefore, annual family charitable contributions 
has a population distribution that is spread out over a large range of positive values, but with a pileup 
at the value zero. Although a linear model could be appropriate for capturing the expected value of 
charitable contributions, a linear model will likely lead to negative predictions for some families. 
Taking the natural log is not possible because many observations are zero. The Tobit model, which we 
cover in Section 17-2, is explicitly designed to model corner solution dependent variables. 

Another important kind of LDV is a count variable, which takes on nonnegative integer values. 
Section 17-3 illustrates how Poisson regression models are well suited for modeling count variables. 

In some cases, we encounter limited dependent variables due to data censoring, a topic we 
introduce in Section 17-4. The general problem of sample selection, where we observe a nonrandom 
sample from the underlying population, is treated in Section 17-5. 

Limited dependent variable models can be used for time series and panel data, but they are 
most often applied to cross-sectional data. Sample selection problems are usually confined to cross- 
sectional or panel data. We focus on cross-sectional applications in this chapter. Wooldridge (2010) 
analyzes these problems in the context of panel data models and provides many more details for 


cross-sectional and panel data applications. 


17-1 Logit and Probit Models for Binary Response 


The linear probability model is simple to estimate and use, but it has some drawbacks that we dis- 
cussed in Section 7-5. The two most important disadvantages are that the fitted probabilities can be 
less than zero or greater than one and the partial effect of any explanatory variable (appearing in level 
form) is constant. These limitations of the LPM can be overcome by using more sophisticated binary 
response models. 

In a binary response model, interest lies primarily in the response probability 


P(y = 1|x) = Ply = 1x; X3.. -3 X), [17.1] 


where we use x to denote the full set of explanatory variables. For example, when y is an employment 
indicator, x might contain various individual characteristics such as education, age, marital status, and 
other factors that affect employment status, including a binary indicator variable for participation in a 
recent job training program. 


17-1a Specifying Logit and Probit Models 


In the LPM, we assume that the response probability is linear in a set of parameters, j; see 
equation (7.27). To avoid the LPM limitations, consider a class of binary response models of the form 


P(y < 1|x) = G(Bo + Byxy bo + Bexi) = G(Bo + xB), [17.2] 


where G is a function taking on values strictly between zero and one: 0 < G(z) < 1, for all real 
numbers z. This ensures that the estimated response probabilities are strictly between zero and one. 
As in earlier chapters, we write xB = Byx, + + BX 
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Various nonlinear functions have been suggested for the function G to make sure that the 
probabilities are between zero and one. The two we will cover here are used in the vast majority of 
applications (along with the LPM). In the logit model, G is the logistic function: 


G(z) = exp(z)/[1 + exp(z)] = A(z), [17.3] 


which is between zero and one for all real numbers z. This is the cumulative distribution function 
(cdf) for a standard logistic random variable. In the probit model, G is the standard normal cdf, 
which is expressed as an integral: 


G(z) = ®(z) = [ ob(v)dv, [17.4] 
where #(z) is the standard normal density 
e(z) = (2a) exp(—z’7/2). [17.5] 


This choice of G again ensures that (17.2) is strictly between zero and one for all values of the param- 
eters and the x. 

The G functions in (17.3) and (17.4) are both increasing functions. Each increases most quickly at 
z = 0, G(z) > 0asz— —%, and G(z) > 1 as z > %. The logistic function is plotted in Figure 17.1. 
The standard normal cdf has a shape very similar to that of the logistic cdf. 

Logit and probit models can be derived from an underlying latent variable model. Let y“ be an 
unobserved, or latent, variable, and suppose that 


y= Bo + xB + e, y = 1h* > 0], [17.6] 


where we introduce the notation 1[- ] to define a binary outcome. The function 1[- ] is called the 
indicator function, which takes on the value one if the event in brackets is true, and zero otherwise. 
Therefore, y is one if y* > 0, and y is zero if y* < 0. We assume that e is independent of x and that 
e has either the standard logistic distribution or the standard normal distribution. In either case, e is 


FIGURE 17.1 Graph of the logistic function G(z) = exp(z)/[1 + exp(z) |. 


G(z) = exp(z)/[1 + exp(z)] 
1 
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symmetrically distributed about zero, which means that 1 — G(—z) = G(z) for all real numbers z. 
Economists tend to favor the normality assumption for e, which is why the probit model is more pop- 
ular than logit in econometrics. In addition, several specification problems, which we touch on later, 
are most easily analyzed using probit because of properties of the normal distribution. 

From (17.6) and the assumptions given, we can derive the response probability for y: 


P(y = 1|x) = P(y* > O|x) = Ple > - (£ + xg)lx] 
als G|- (Bo p xB) | = G(Bo + xB), 


which is exactly the same as (17.2). 

In most applications of binary response models, the primary goal is to explain the effects of the 
x; on the response probability P(y = 1|x). The latent variable formulation tends to give the impres- 
sion that we are primarily interested in the effects of each x; on y”. As we will see, for logit and probit, 
the direction of the effect of x; on E(y"|x) = By + xB and on E(y|x) = P(y = I|x) = G(Bo + xB) 
is always the same. But the latent variable y“ rarely has a well-defined unit of measurement. (For 
example, y* might be the difference in utility levels from two different actions.) Thus, the magnitudes 
of each 6; are not, by themselves, especially useful (in contrast to the linear probability model). For 
most purposes, we want to estimate the effect of x; on the probability of success P(y = 1|x), but this 
is complicated by the nonlinear nature of G(-). 

To find the partial effect of roughly continuous variables on the response probability, we must 
rely on calculus. If x; is a roughly continuous variable, its partial effect on p(x) = P(y = 1|x) is 
obtained from the partial derivative: 

ap(x) 


= 8(Bo + ¥B)B where g(z) = dG 
i dz 


(z). [17.7] 


Because G is the cdf of a continuous random variable, g is a probability density function (pdf). In the 
logit and probit cases, G(-) is a strictly increasing cdf, and so g(z) > 0 for all z. Therefore, the partial 
effect of x; on p(x) depends on x through the positive quantity g(By + xB), which means that the 
partial effect always has the same sign as £. 

Equation (17.7) shows that the relative effects of any two continuous explanatory variables do 
not depend on x: the ratio of the partial effects for x; and x, is 6/6}. In the typical case that g is a sym- 
metric density about zero, with a unique mode at zero, the largest effect occurs when By + xB = 0. 
For example, in the probit case with g(z) = (z), (0) = (0) = 1/2 = .AO. In the logit case, 
g(z) = exp(z)/[1 + exp(z) ?, and so g(0) = .25. 

If, say, x, is a binary explanatory variable, then the partial effect from changing x, from zero to 
one, holding all other variables fixed, is simply 


G(Bo + By + Boxy + + Bx) = G(Bo + Boxy bv + Bx): [17.8] 


Again, this depends on all the values of the other x;. For example, if y is an employment indicator and 
x, is adummy variable indicating participation in a job training program, then (17.8) is the change in 
the probability of employment due to the job training program; this depends on other characteristics 
that affect employability, such as education and experience. Note that knowing the sign of 6; is suffi- 
cient for determining whether the program had a positive or negative effect. But to find the magnitude 
of the effect, we have to estimate the quantity in (17.8). 

We can also use the difference in (17.8) for other kinds of discrete variables (such as number of 
children). If x, denotes this variable, then the effect on the probability of x, going from c, to c, + 1 is 
simply 


G[Bo + Bix, + Box, + + Bylcy + 1)] 


17.9 
= G(Bo + Bixi + Box, Fo + Brci). ! 
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It is straightforward to include standard functional forms among the explanatory variables. For 
example, in the model 


Ply = 1|z) = G(Bo + Biz + Bazzi + Bslog(z2) T Baz3), 


the partial effect of z; on P(y = 1|z) is dP(y = 1|z)/dz, = g(By + xB)(B, + 2ßzzı), and the 
partial effect of z) on the response probability is dP(y = 1|z)/dz = g(By + xB)(Bs/z), where 
xB = Biz, + Bozt + Bslog(z) + Bazz. Therefore, g(B + xB)(8;/100) is the approximate change 
in the response probability when z; increases by 1%. 

Sometimes we want to compute the elasticity of the response probability with respect to an 
explanatory variable, although we must be careful in interpreting percentage changes in probabilities. 
For example, a change in a probability from .04 to .06 represents a 2-percentage-point increase in the 
probability, but a 50% increase relative to the initial value. Using calculus, in the preceding model 
the elasticity of P(y = 1|z) with respect to z, can be shown to be B3[g(B) + xB)/G(By + xB) |. The 
elasticity with respect to z; is (Baza)le(Bo + xB)/G(By + xB) ]. In the first case, the elasticity is 
always the same sign as f, but it generally depends on all parameters and all values of the explana- 
tory variables. If z3 > 0, the second elasticity always has the same sign as the parameter 64. 

Models with interactions among the explanatory variables can be a bit tricky, but one should 
compute the partial derivatives and then evaluate the resulting partial effects at interesting values. 
When measuring the effects of discrete variables—no matter how complicated the model—we should 
use (17.9). We discuss this further in the subsection on interpreting the estimates on page 530. 


17-16 Maximum Likelihood Estimation of Logit and Probit Models 


How should we estimate nonlinear binary response models? To estimate the LPM, we can use ordi- 
nary least squares (see Section 7-5) or, in some cases, weighted least squares (see Section 8-5). 
Because of the nonlinear nature of E(y|x), OLS and WLS are not applicable. We could use nonlinear 
versions of these methods, but it is no more difficult to use maximum likelihood estimation (MLE) 
(see Appendix 17A for a brief discussion). Up until now, we have had little need for MLE, although 
we did note that, under the classical linear model assumptions, the OLS estimator is the maximum 
likelihood estimator (conditional on the explanatory variables). For estimating limited dependent var- 
iable models, maximum likelihood methods are indispensable. Because MLE is based on the distribu- 
tion of y given x, the heteroskedasticity in Var(y|x) is automatically accounted for. 

Assume that we have a random sample of size n. To obtain the maximum likelihood estimator, 
conditional on the explanatory variables, we need the density of y; given x;. We can write this as 


folxsB) = [G(xB) PU — G(xB)]'?, y = 0, 1, [17.10] 


where, for simplicity, we absorb the intercept into the vector x;. We can easily see that when y = 1, 
we get G(x;B) and when y = 0, we get 1 — G(x,f). The log-likelihood function for observation i is 
a function of the parameters and the data (x;, y;) and is obtained by taking the log of (17.10): 

€(B) = yloglG(x,B)] + (1 — y,)logl1 — G(x,B)]- [17.11] 
Because G(-) is strictly between zero and one for logit and probit, ;( B) is well defined for all values 
of B. 

The log-likelihood for a sample size of n is obtained by summing (17.11) across all observations: 
L(B) = Dd}-,¢( B). The MLE of B, denoted by B, maximizes this log-likelihood. If G(-) is the standard 
logit cdf, then B is the logit estimator; if G(-) is the standard normal cdf, then $ is the probit estimator. 

Because of the nonlinear nature of the maximization problem, we cannot write formulas for the 
logit or probit maximum likelihood estimates. In addition to raising computational issues, this makes 
the statistical theory for logit and probit much more difficult than OLS or even 2SLS. Nevertheless, 
the general theory of MLE for random samples implies that, under very general conditions, the 
MLE is consistent, asymptotically normal, and asymptotically efficient. [See Wooldridge (2010, 
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Chapter 13) for a general discussion.] We will just use the results here; applying logit and probit mod- 
els is fairly easy, provided we understand what the statistics mean. 

Each Ê; comes with an (asymptotic) standard error, the formula for which is complicated and pre- 
sented in the chapter appendix. Once we have the standard errors—and these are reported along with 
the coefficient estimates by any package that supports logit and probit—we can construct (asymp- 
totic) ¢ tests and confidence intervals, just as with OLS, 2SLS, and the other estimators we have 
encountered. In particular, to test Ho: 6; = 0, we form the ż statistic B;/se( Bi) and carry out the test in 
the usual way, once we have decided on a one- or two-sided alternative. 


17-1c Testing Multiple Hypotheses 


We can also test multiple restrictions in logit and probit models. In most cases, these are tests of mul- 
tiple exclusion restrictions, as in Section 4-5. We will focus on exclusion restrictions here. 

There are three ways to test exclusion restrictions for logit and probit models. The Lagrange mul- 
tiplier or score test only requires estimating the model under the null hypothesis, just as in the linear 
case in Section 5-2; we will not cover the score test here, since it is rarely needed to test exclusion 
restrictions. [See Wooldridge (2010, Chapter 15) for other uses of the score test in binary response 
models. | 

The Wald test requires estimation of only the unrestricted model. In the linear model case, the 
Wald statistic, after a simple transformation, is essentially the F statistic, so there is no need to 
cover the Wald statistic separately. The formula for the Wald statistic is given in Wooldridge (2010, 
Chapter 15). This statistic is computed by econometrics packages that allow exclusion restrictions to 
be tested after the unrestricted model has been estimated. It has an asymptotic chi-square distribution, 
with df equal to the number of restrictions being tested. 

If both the restricted and unrestricted models are easy to estimate—as is usually the case with 
exclusion restrictions—then the likelihood ratio (LR) test becomes very attractive. The LR test is 
based on the same concept as the F test in a linear 
model. The F test measures the increase in the sum 
of squared residuals when variables are dropped from 
the model. The LR test is based on the difference in 
the log-likelihood functions for the unrestricted and 
restricted models. The idea is this: Because the MLE 
maximizes the log-likelihood function, dropping 
variables generally leads to a smaller—or at least no 


a GOING FURTHER 17.1 


A probit model to explain whether a firm is 
taken over by another firm during a given 
year is 


P(takeover = 1|x) = ®(B + B;avgprof 


+ Bomktval larger—log-likelihood. (This is similar to the fact 
+ Badebtearn that the R-squared never increases when variables are 
+ Baceoten dropped from a regression.) The question is whether 
+ Bsceosal the fall in the log-likelihood is large enough to con- 
+ Bs;ceoage), 


clude that the dropped variables are important. We 


where takeover is a binary response variable, can make this decision once we have a test statistic 


avgprot is the firm’s average profit margin 
over several prior years, mktval is market 
value of the firm, debtearn is the debt-to- 
earnings ratio, and ceoten, ceosal, and ceo- 
age are the tenure, annual salary, and age 
of the chief executive officer, respectively. 
State the null hypothesis that, other factors 
being equal, variables related to the CEO 
have no effect on the probability of takeover. 
How many df are in the chi-square distribu- 
tion for the LR or Wald test? 


and a set of critical values. 
The likelihood ratio statistic is twice the differ- 
ence in the log-likelihoods: 


LR = (Lp — L), [17.12] 


where &,, is the log-likelihood value for the unre- 
stricted model and $£, is the log-likelihood value for 
the restricted model. Because &,,, = £, LR is non- 
negative and usually strictly positive. In comput- 
ing the LR statistic for binary response models, it is 
important to know that the log-likelihood function is 


CHAPTER 17 Limited Dependent Variable Models and Sample Selection Corrections 565 


always a negative number. This fact follows from equation (17.11), because y; is either zero or one 
and both variables inside the log function are strictly between zero and one, which means their natural 
logs are negative. That the log-likelihood functions are both negative does not change the way we 
compute the LR statistic; we simply preserve the negative signs in equation (17.12). 

The multiplication by two in (17.12) is needed so that LR has an approximate chi-square distribu- 
tion under Hp. If we are testing q exclusion restrictions, LR © X; This means that, to test Ho at the 
5% level, we use as our critical value the 95" percentile in the Pe distribution. Computing p-values is 
easy with most software packages. 


17-1d Interpreting the Logit and Probit Estimates 


Given modern computers, from a practical perspective the most difficult aspect of logit or probit 
models is presenting and interpreting the results. The coefficient estimates, their standard errors, and 
the value of the log-likelihood function are reported by all software packages that do logit and probit, 
and these should be reported in any application. The coefficients give the signs of the partial effects of 
each x; on the response probability, and the statistical significance of x; is determined by whether we 
can reject Ho: 6; = 0 at a sufficiently small significance level. 

As we briefly discussed in Section 7-5 for the linear probability model, we can compute a 
goodness-of-fit measure called the percent correctly predicted. As before, we define a binary pre- 
dictor of y; to be one if the predicted probability is at least .5, and zero otherwise. Mathematically, 
5; = Lif G(B) + xB) = .5 and F; = 0 if G(By + xB) < .5. Given {¥;: i = 1, 2,..., n}, we can see 
how well ¥; predicts y; across all observations. There are four possible outcomes on each pair, (y,, ¥;); 
when both are zero or both are one, we make the correct prediction. In the two cases where one of the 
pair is zero and the other is one, we make the incorrect prediction. The percentage correctly predicted 
is the percentage of times that ¥; = yj. 

Although the percentage correctly predicted is useful as a goodness-of-fit measure, it can be mis- 
leading. In particular, it is possible to get rather high percentages correctly predicted even when the 
least likely outcome is very poorly predicted. For example, suppose that n = 200, 160 observations 
have y; = 0, and, out of these 160 observations, 140 of the ¥; are also zero (so we correctly predict 
87.5% of the zero outcomes). Even if none of the predictions is correct when y; = 1, we still correctly 
predict 70% of all outcomes (140/200 = .70). Often, we hope to have some ability to predict the 
least likely outcome (such as whether someone is arrested for committing a crime), and so we should 
be up front about how well we do in predicting each outcome. Therefore, it makes sense to also com- 
pute the percentage correctly predicted for each of the outcomes. Problem 1 asks you to show that the 
overall percentage correctly predicted is a weighted average of ĝọ (the percentage correctly predicted 
for y; = 0) and gq, (the percentage correctly predicted for y; = 1), where the weights are the fractions 
of zeros and ones in the sample, respectively. 

Some have criticized the prediction rule just described for using a threshold value of .5, espe- 
cially when one of the outcomes is unlikely. For example, if y = .08 (only 8% “successes” in the 
sample), it could be that we never predict y; = 1 because the estimated probability of success is never 
greater than .5. One alternative is to use the fraction of successes in the sample as the threshold—.08 
in the previous example. In other words, define ¥; = 1 when G(B, ae xB) = .08, and zero other- 
wise. Using this rule will certainly increase the number of predicted successes, but not without cost: 
we will necessarily make more mistakes—perhaps many more—in predicting zeros (“failures”). In 
terms of the overall percentage correctly predicted, we may do worse than using the .5 threshold. 

A third possibility is to choose the threshold such that the fraction of ¥; = 1 in the sample is the 
same as (or very close to) y. In other words, search over threshold values 7,0 < 7 < 1, such that if 
we define ¥; = 1 when G(Bo + xB) > q, then >}_,¥; ~ D}-1);. (The trial and error required to 
find the desired value of 7 can be tedious, but it is feasible. In some cases, it will not be possible to 
make the number of predicted successes exactly the same as the number of successes in the sample.) 
Now, given this set of ¥;, we can compute the percentage correctly predicted for each of the two out- 
comes as well as the overall percentage correctly predicted. 
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There are also various pseudo R-squared measures for binary response. McFadden (1974) sug- 
gests the measure 1 — £ „/£Lo, where L, is the log-likelihood function for the estimated model and 
Lp is the log-likelihood function in the model with only an intercept. Why does this measure make 
sense? Recall that the log-likelihoods are negative, and so &,,/L,, = |E |E l. Further, |£, < |L.. 
If the covariates have no explanatory power, then £,„/£, = 1, and the pseudo R-squared is zero, just 
as the usual R-squared is zero in a linear regression when the covariates have no explanatory power. 
Usually, |£ < |£], in which 1 — £,„/£, > 0. If £, were zero, the pseudo R-squared would equal 
unity. In fact, £,,. cannot reach zero in a probit or logit model, as that would require the estimated 
probabilities when y; = 1 all to be unity and the estimated probabilities when y; = 0 all to be zero. 

Alternative pseudo R-squareds for probit and logit are more directly related to the usual R-squared 
from OLS estimation of a linear probability model. For either probit or logit, let $; = G( Bo + xB ) be 
the fitted probabilities. Since these probabilities are also estimates of E(y;|x;), we can base an R-squared 
on how close the 4; are to the y;. One possibility that suggests itself from standard regression analysis is 
to compute the squared correlation between y; and ĵ;. Remember, in a linear regression framework, this 
is an algebraically equivalent way to obtain the usual R-squared; see equation (3.29). Therefore, we 
can compute a pseudo R-squared for probit and logit that is directly comparable to the usual R-squared 
from estimation of a linear probability model. In any case, goodness-of-fit is usually less important 
than trying to obtain convincing estimates of the ceteris paribus effects of the explanatory variables. 

Often, we want to estimate the effects of the x; on the response probabilities, P(y = 1x). If xis 
(roughly) continuous, then 


AP(y = 11x) ~ [8(bo + xB)B)JAx;, [17.13] 


for “small” changes in x). So, for Ax; = 1, the change in the estimated success probability is roughly 
g( Bo F xB) Bi. Compared with the fineat probability model, the cost of using probit and logit mod- 
els is that the partial effects in equation (17.13) are harder to summarize because the scale factor, 
g( Bo T xB), depends on x (that is, on all of the explanatory variables). One possibility is to plug in 
interesting values for the xj—such as means, medians, minimums, maximums, and lower and upper 
quartiles—and then see thew g( Bo z5 xÊ ) changes. Although attractive, this can be tedious and result 
in too much information even if the number of explanatory variables is moderate. 

As a quick summary for getting at the magnitudes of the partial effects, it is handy to have a single 
scale factor that can be used to multiply each Ê j (or at least those coefficients on roughly continuous varia- 
bles). One method, commonly used in econometrics packages that routinely estimate probit and logit mod- 
els, is to replace each explanatory variable with its sample average. In other words, the adjustment factor is 


glêo + xB) = e(Bo + pix + Boxy pe Bx). [17.14] 


where g(-) is the standard normal density in the probit case and g(z) = exp(z)/[1 + exp(z) |’ in the 
logit case. The idea behind (17.14) is that, when it is multiplied by Bi, we obtain the partial effect of 
x; for the “average” person in the sample. Thus, if we multiply a coefficient by (17.14), we generally 
obtain the partial effect at the average (PEA). 

There are at least two potential problems with using PEAs to summarize the partial effects of the 
explanatory variables. First, if some of the explanatory variables are discrete, the averages of them rep- 
resent no one in the sample (or population, for that matter). For example, if x; = female and 47.5% of 
the sample is female, what sense does it make to plug in x, = .475 to represent the “average” person? 
Second, if a continuous explanatory variable appears as a nonlinear function—say, as a natural log or 
in a quadratic—it is not clear whether we want to average the nonlinear function or plug the average 
into the nonlinear function. For example, should we use log(sales) or log(sales) to represent average 
firm size? Econometrics packages that compute the scale factor in (17.14) default to the former: the 
software is written to compute the averages of the regressors included in the probit or logit estimation. 

A different approach to computing a scale factor circumvents the issue of which values to plug 
in for the explanatory variables. Instead, the second scale factor results from averaging the individual 
partial effects across the sample, leading to what is called the average partial effect (APE) or, some- 
times, the average marginal effect (AME). For a continuous explanatory variable x;, the average 
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partial effect is n™'!S"_, [g(By + xiB)B)] = [n E; (Bo + xB) IĜ.. The term multiplying Ê; acts 
as a scale factor: 


mS eB + x). [17.15] 


Equation (17.15) is easily computed after probit or logit estimation, where g( Êo +X; B) = (Bo T xB) 
in the probit case and g(ĝ + x8) = expl Êo + x;B)/[1 + exp(Bo + xB) in the logit case. The 
two scale factors differ—and are possibly quite different—because in (17.15) we are using the aver- 
age of the nonlinear function rather than the nonlinear function of the average [as in (17.14)]. 

Because both of the scale factors just described depend on the calculus approximation in (17.13), 
neither makes much sense for discrete explanatory variables. Instead, it is better to use equation (17.9) 
to directly estimate the change in the probability. For a change in x, from c, to c + 1, the discrete 
analog of the partial effect based on (17.14) is 


G[Bo T px Psp Bratr + AG T 1)] 


n aes x [17.16] 
a G(Bo + Bixi +: + Bk-1Xk-1 + Bcr), 


where G is the standard normal cdf in the probit case and G(z) = exp(z)/[1 + exp(z)] in the logit 
case. The average partial effect, which usually is more comparable to LPM estimates, is 


n! S{G[Bo F Êx s Bi- iri + AG + 1)] 
i=l 


= G(Bo T Bx ape ee eT Êr- 1X1 + Bicy)}- 


[17.17] 


The quantity in equation (17.17) is a “partial” effect because all explanatory variables other than 
x, are being held fixed at their observed values. It is not necessarily a “marginal” effect because the 
change in x, from c, to c, + 1 may not be a “marginal” (or “small’”) increase; whether it is depends 
on the definition of x,. Obtaining expression (17.17) for either probit or logit is actually rather simple. 
First, for each observation, we estimate the probability of success for the two chosen values of x;,, plug- 
ging in the actual outcomes for the other explanatory variables. (So, we would have n estimated differ- 
ences.) Then, we average the differences in estimated probabilities across all observations. For binary 
Xy both (17.16) and (17.17) are easily computed using certain econometrics packages, such as Stata®. 

The expression in (17.17) has a particularly useful interpretation when x, is a binary variable. For each 
unit 7, we estimate the predicted difference in the probability that y; = 1 when x, = 1 and x, = 0, namely, 


G(Bo + Bix; reg Bei + Bi) = G(Bo + Pixs Feet Petia 


For each i, this difference is the estimated effect of switching x, from zero to one, whether unit 7 
had x, = 1 or x, = 0. For example, if y is an employment indicator (equal to one if the person is 
employed) after participation in a job training program, indicated by x,, then we can estimate the dif- 
ference in employment probabilities for each person in both states of the world. This is precisely the 
counterfactual that arises in the potential outcomes framework discussed in Sections 3-7e, 4-7, and 
7-6a. In the current case, the underlying potential outcomes, y(0) and y(1), are both binary. Rather than 
use regression adjustment in a linear model—that is, using a linear probability model—we can instead 
use a logit model or a probit model. If x, is an indicator for participating in a job training program, 
and y is an employment indicator, then we estimate the employment probability for every individual 
in both states of the world—even though we only see them in one state (participate in the program or 
not). When we average the difference in the estimated probabilities across individuals, we obtain the 
average treatment effect. Using a model such as logit or probit makes it harder to obtain an average 
treatment effect, and especially its standard error, compared with linear regression. However, many 
econometrics software packages now have routines specifically to handle such cases. Problem 8 at the 
end of this chapter explores the potential outcomes framework with nonlinear models in more detail. 
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As another example of where we can use equation (17.17), with c, = 0, to study an important pol- 
icy question, suppose that y indicates whether a family was approved for a mortgage, and x, is a binary 
race indicator (equal to one for nonwhites). After estimating a logit or probit model that includes the 
race indicator, along with control variables, we can, for each family, compute the estimated differ- 
ences in the probabilities of being approved for the loan under the two scenarios that the household 
head is nonwhite versus white. By including income, wealth, credit rating, features of the loan, and so 
on—which would be elements of x; Xj, . . . , Xy—hopefully we can control for enough factors so that 
averaging the differences in probabilities results in a convincing estimate of the race effect. 

In applications where one applies probit, logit, and the LPM, it makes sense to compute the scale 
factors described above for probit and logit in making comparisons of partial effects. Still, sometimes 
one wants a quicker way to compare magnitudes of the different estimates. As mentioned earlier, for 
probit g(0) ~ .4 and for logit, g(0) = .25. Thus, to make the magnitudes of probit and logit roughly 
comparable, we can multiply the probit coefficients by .4/.25 = 1.6, or we can multiply the logit 
estimates by .625. In the LPM, g(0) is effectively one, so the logit slope estimates can be divided by 
four to make them comparable to the LPM estimates; the probit slope estimates can be divided by 
2.5 to make them comparable to the LPM estimates. Still, in most cases, we want the more accurate 
comparisons obtained by using the scale factors in (17.15) for logit and probit. For binary explanatory 
variables, we use (17.25) for logit and probit. 


Married Women’s Labor Force Participation 


We now use the data on 753 married women in MROZ to estimate the labor force participation model 
from Example 8.8—see also Section 7-5—by logit and probit. We also report the linear probability 
model estimates from Example 8.8, using the heteroskedasticity-robust standard errors. The results, 
with standard errors in parentheses, are given in Table 17.1. 


TABLE 17.1 LPM, Logit, and Probit Estimates of Labor Force Participation 


Dependent Variable: inlf 
Independent Variables LPM (OLS) Logit (MLE) Probit (MLE) 
nwifeinc —.0034 —.021 —.012 
(.0015) (.008) (.005) 
educ .038 221 131 
(.007) (.043) (.025) 
exper .039 .206 123 
(.006) (.032) (.019) 
exper? —.00060 —.0032 —.0019 
(.00019) (.0010) (.0006) 
age —.016 — .088 —.053 
(.002) (.015) (.008) 
kidslt6 —.262 —1.443 —.868 
(.032) (.204) (.119) 
kidsge6 013 .060 .036 
(.014) (.075) (.043) 
constant .586 .425 .270 
(.152) (.860) (.509) 
Percentage correctly predicted 73.4 73.6 73.4 
Log-likelihood value — —401.77 —401.30 
Pseudo A-squared .264 .220 221 
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GOING FURTHER 17.2 The estimates from the three models tell a 

consistent story. The signs of the coefficients are 
Using the probit estimates and the calcu- | the same across models, and the same variables are 
lus approximation, what is the approximate | statistically significant in each model. The pseudo 
change in the response probability when | R-squared for the LPM is just the usual R-squared 
exper increases from 10 to 11? reported for OLS; for logit and probit, the pseudo 
R-squared is the measure based on the log-likelihoods 
described earlier. 

As we have already emphasized, the magnitudes of the coefficient estimates across models are 
not directly comparable. Instead, we compute the scale factors in equations (17.14) and (17.15). If 
we evaluate the standard normal pdf (Bo ak Bux an Boxy PAE Bex) at the sample averages of 
the explanatory variables (including the average of exper’, kidslt6, and kidsge6), the result is approxi- 
mately .391. When we compute (17.14) for the logit case, we obtain about .243. The ratio of these, 
.391/.243 = 1.61, is very close to the simple rule of thumb for scaling up the probit estimates to make 
them comparable to the logit estimates: multiply the probit estimates by 1.6. Nevertheless, for compar- 
ing probit and logit to the LPM estimates, it is better to use (17.15). These scale factors are about .301 
(probit) and .179 (logit). For example, the scaled logit coefficient on educ is about .179(.221) ~ .040, 
and the scaled probit coefficient on educ is about .301(.131) ~ .039; both are remarkably close to 
the LPM estimate of .038. Even on the discrete variable kids/t6, the scaled logit and probit coef- 
ficients are similar to the LPM coefficient of —.262. These are .179(—1.443) ~ —.258 (logit) and 
.301(—.868) = —.261 (probit). 

Table 17.2 reports the average partial effects for all explanatory variables and for each of the 
three estimated models. We obtained the estimates and standard errors from the statistical package 
Stata® 13. These APEs treat all explanatory variables as continuous, even the variables for the num- 
ber of children. Obtaining the APE for exper requires some care, as it must account for the quadratic 
functional form in exper. Even for the linear model we must compute the derivative and then find the 
average. In the LPM column, the APE of exper is the average of the derivative with respect to exper, 
so .039 — .0012 exper; averaged across all i. (The remaining APE entries for the LPM column are 
simply the OLS coefficients in Table 17.1.) The APEs for exper for the logit and probit models also 
account for the quadratic in exper. As is clear from the table, the APEs, and their statistical signifi- 
cance, are very similar for all explanatory variables across all three models. 

The biggest difference between the LPM model and the logit and probit models is that the LPM 
assumes constant marginal effects for educ, kidslt6, and so on, while the logit and probit models 


TABLE 17.2 Average Partial Effects for the Labor Force Participation Models 


Independent Variables LPM Logit Probit 
nwifeinc —.0034 —.0038 —.0036 
(.0015) (.0015) (.0014) 
educ .038 .039 .039 
(.007) (.007) (.007) 
exper 027 025 .026 
(.002) (.002) (.002) 
age —.016 —.016 —.016 
(.002) (.002) (.002) 
kidslt6 —.262 —.258 —.261 
(.032) (.032) (:032) 
kidsge6 .013 .011 .011 


(.014) (.013) (.013) 
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imply diminishing magnitudes of the partial effects. In the LPM, one more small child is estimated 
to reduce the probability of labor force participation by about .262, regardless of how many young 
children the woman already has (and regardless of the levels of the other explanatory variables). We 
can contrast this with the estimated marginal effect from probit. For concreteness, take a woman 
with nwifeinc = 20.13, educ = 12.3, exper = 10.6, and age = 42.5—which are roughly the sample 
averages—and kidsge6 = 1. What is the estimated decrease in the probability of working in going 
from zero to one small child? We evaluate the standard normal cdf, ®( Bo + B xXx te + Bex); with 
kidslt6 = 1 and kidsit6 = 0, and the other independent variables set at the preceding values. We get 
roughly .373 — .707 = —.334, which means that the labor force participation probability is about 
.334 lower when a woman has one young child. If the woman goes from one to two young chil- 
dren, the probability falls even more, but the marginal effect is not as large: .117 — .373 = —.256. 
Interestingly, the estimate from the linear probability model, which is supposed to estimate the effect 
near the average, is in fact between these two estimates. (Note that the calculations provided here, 
which use coefficients mostly rounded to the third decimal place, will differ somewhat from calcula- 
tions obtained within a statistical package—which would be subject to less rounding error.) 


Figure 17.2 illustrates how the estimated response probabilities from nonlinear binary response 
models can differ from the linear probability model. The estimated probability of labor force par- 
ticipation is graphed against years of education for the linear probability model and the probit 
model. (The graph for the logit model is very similar to that for the probit model.) In both cases, the 
explanatory variables, other than educ, are set at their sample averages. In particular, the two equa- 
tions graphed are inlf = .102 + .038 educ for the linear model and in/f = ®(—1.403 + .131 educ). 
At lower levels of education, the linear probability model estimates higher labor force participation 
probabilities than the probit model. For example, at eight years of education, the linear probability 
model estimates a .406 labor force participation probability while the probit model estimates about .361. 


FIGURE 17.2 Estimated response probabilities with respect to education for the linear 


probability and probit models. 
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The estimates are the same at around 114% years of education. At higher levels of education, the probit 
model gives higher labor force participation probabilities. In this sample, the smallest years of educa- 
tion is 5 and the largest is 17, so we really should not make comparisons outside this range. 

The same issues concerning endogenous explanatory variables in linear models also arise in 
logit and probit models. We do not have the space to cover them, but it is possible to test and cor- 
rect for endogenous explanatory variables using methods related to two stage least squares. Evans 
and Schwab (1995) estimated a probit model for whether a student attends college, where the key 
explanatory variable is a dummy variable for whether the student attends a Catholic school. Evans 
and Schwab estimated a model by maximum likelihood that allows attending a Catholic school to be 
endogenous. [See Wooldridge (2010, Chapter 15) for an explanation of these methods. ] 

Two other issues have received attention in the context of probit models. The first is nonnormal- 
ity of e in the latent variable model (17.6). Naturally, if e does not have a standard normal distribution, 
the response probability will not have the probit form. Some authors tend to emphasize the inconsist- 
ency in estimating the £, but this is the wrong focus unless we are only interested in the direction of 
the effects. Because the response probability is unknown, we could not estimate the magnitude of 
partial effects even if we had consistent estimates of the 6;. 

A second specification problem, also defined in terms of the latent variable model, is heteroske- 
dasticity in e. If Var(e|x) depends on x, the response probability no longer has the form G(By + xB); 
instead, it depends on the form of the variance and requires more general estimation. Such models 
are not often used in practice, since logit and probit with flexible functional forms in the independent 
variables tend to work well. 

Binary response models apply with little modification to independently pooled cross sections or 
to other data sets where the observations are independent but not necessarily identically distributed. 
Often, year or other time period dummy variables are included to account for aggregate time effects. 
Just as with linear models, logit and probit can be used to evaluate the impact of certain policies in the 
context of a natural experiment. 

The linear probability model can be applied with panel data; typically, it would be estimated by 
fixed effects (see Chapter 14). Logit and probit models with unobserved effects have recently become 
popular. These models are complicated by the nonlinear nature of the response probabilities, and they 
are more difficult to estimate and interpret. [See Wooldridge (2010, Chapter 15).] 


17-2 The Tobit Model for Corner Solution Responses 


As mentioned in the chapter introduction, another important kind of limited dependent variable is 
a corner solution response. Such a variable is zero for a nontrivial fraction of the population but is 
roughly continuously distributed over positive values. An example is the amount an individual spends 
on alcohol in a given month. In the population of people over age 21 in the United States, this variable 
takes on a wide range of values. For some significant fraction, the amount spent on alcohol is zero. 
The following treatment omits verification of some details concerning the Tobit model. [These are 
given in Wooldridge (2010, Chapter 17).] 

Let y be a variable that is essentially continuous over strictly positive values but that takes on a 
value of zero with positive probability. Nothing prevents us from using a linear model for y. In fact, 
a linear model might be a good approximation to E(y|x,, x, ..., X4), especially for x; near the mean 
values. But we would possibly obtain negative fitted values, which leads to negative predictions for y; 
this is analogous to the problems with the LPM for binary outcomes. Also, the assumption that an 
explanatory variable appearing in level form has a constant partial effect on E(y|x) can be misleading. 
Probably, Var(y|x) would be heteroskedastic, although we can easily deal with general heteroskedas- 
ticity by computing robust standard errors and test statistics. Because the distribution of y piles up at 
zero, y clearly cannot have a conditional normal distribution. So all inference would have only asymp- 
totic justification, as with the linear probability model. 
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In some cases, it is important to have a model that implies nonnegative predicted values for y, and 
which has sensible partial effects over a wide range of the explanatory variables. Plus, we sometimes 
want to estimate features of the distribution of y given x,,..., x, other than the conditional expecta- 
tion. The Tobit model is quite convenient for these purposes. Typically, the Tobit model expresses the 
observed response, y, in terms of an underlying latent variable: 


y“ = Bo + xB + u, ulx ~ Normal(0, o°) [17.18] 


y = max(0,y*). [17.19] 
The latent variable y“ satisfies the classical linear model assumptions; in particular, it has a normal, 
homoskedastic distribution with a linear conditional mean. Equation (17.19) implies that the observed 
variable, y, equals y“ when y“ = 0, but y = 0 when y* < 0. Because y” is normally distributed, y has a 
continuous distribution over strictly positive values. In particular, the density of y given x is the same 
as the density of y“ given x for positive values. Further, 


P(y = O|x) = P(y* < O|x) = P(u < —xßlx) 
= P(uw/o < —xB/o|x) = ®(-xB/oc) = 1 — O(xP/o), 


because u/o has a standard normal distribution and is independent of x; we have absorbed the inter- 
cept into x for notational simplicity. Therefore, if (x;, y;) is a random draw from the population, the 
density of y; given x; is 


(27107) expl- (y — xB)*/(207)] = (W/o) dl (y — x,B)/o],y > 0 [17.20] 
P(y; = Olx,) = 1 — ©(x,B/c), [17.21] 


where ¢ is the standard normal density function. 
From (17.20) and (17.21), we can obtain the log-likelihood function for each observation i: 


€(B.o) = 1(y; = 0)log[1 = O(x,B/c) | 
+ 1(y; > O)log{(1/a) bl (y; = x,B)/o |}; 


notice how this depends on ø, the standard deviation of u, as well as on the 6;. The log-likelihood 
for arandom sample of size n is obtained by summing (17.22) across all 7. The maximum likelihood 
estimates of $ and ø are obtained by maximizing the log-likelihood; this requires numerical methods, 
although in most cases this is easily done using a packaged routine. 

As in the case of logit and probit, each Tobit esti- 
mate comes with a standard error, and these can be 
GOING FURTHER 17.3 used to construct ¢ statistics for each Ê; the matrix 
Let y be the number of extramarital affairs | formula used to find the standard errors is compli- 
for a married woman from the U.S. popu- | cated and will not be presented here. [See, for exam- 
lation; we would like to explain this vari- ple, Wooldridge (2010, Chapter 17).] 
elote in EE oF other cmaracteristics 1c Testing multiple exclusion restrictions is easily 
eee pair ay Oe sine Wenke done using the Wald test or the likelihood ratio test. 
pais 2 n Rone st Lia a He The Wald test has a form similar to that of the logit 
family. Is this a good candidate for a Tobit j K i 
model? or probit case; the LR test is always given by (17.12), 

where, of course, we use the Tobit log-likelihood 
functions for the restricted and unrestricted models. 


[17.22] 


17-2a Interpreting the Tobit Estimates 


Using modern computers, it is usually not much more difficult to obtain the maximum likelihood esti- 
mates for Tobit models than the OLS estimates of a linear model. Further, the outputs from Tobit and 
OLS are often similar. This makes it tempting to interpret the Ê; from Tobit as if these were estimates 
from a linear regression. Unfortunately, things are not so easy. 
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From equation (17.18), we see that the 6; measure the partial effects of the x; on E(y*|x), where 
y’ is the latent variable. Sometimes, y“ has an interesting economic meaning, but more often it does 
not. The variable we want to explain is y, as this is the observed outcome (such as hours worked or 
amount of charitable contributions). For example, as a policy matter, we are interested in the sensitiv- 
ity of hours worked to changes in marginal tax rates. 

We can estimate P(y = O|x) from (17.21), which, of course, allows us to estimate P(y > O|x). 
What happens if we want to estimate the expected value of y as a function of x? In Tobit models, 
two expectations are of particular interest: E(yly > 0.x), which is sometimes called the “condi- 
tional expectation” because it is conditional on y > 0, and E(y|x), which is, unfortunately, called the 
“unconditional expectation.” (Both expectations are conditional on the explanatory variables.) The 
expectation E(y|y > 0,x) tells us, for given values of x, the expected value of y for the subpopulation 
where y is positive. Given E(y|y > 0,x), we can easily find E(y|x): 


E(y|x) = P(y > 0|x)-E(yly > 0.x) = ®(xB/o)-E(yly > 0.x). [17.23] 


To obtain E(yly > 0,x), we use a result for normally distributed random variables: if 
z~ Normal(0,1), then E(z|z > c) = ġ(c)/[1 — ®(c)] for any constant c. But E(yly > 0.x) = 
xB + E(ulu > -xB) = xB + cE[(w/o)|(wWo) > —xB/o] = xB + o¢(xB/o)/O(xB/o), because 
o(-c) = (c), 1 — ®(-c) = (c), and w/o has a standard normal distribution independent of x. 

We can summarize this as 


E(yly > 0.x) = xB + cA(xB/o), [17.24] 


where A(c) = &(c)/P(c) is called the inverse Mills ratio; it is the ratio between the standard normal 
pdf and standard normal cdf, each evaluated at c. 

Equation (17.24) is important. It shows that the expected value of y conditional on y > 0 is equal 
to xB plus a strictly positive term, which is ø times the inverse Mills ratio evaluated at xB/o. This 
equation also shows why using OLS only for observations where y; > 0 will not always consistently 
estimate p; essentially, the inverse Mills ratio is an omitted variable, and it is generally correlated 
with the elements of x. 

Combining (17.23) and (17.24) gives 


E(y|x) = ®(xB/o) [xB + cA(xB/o)| = O(xB/o)xB + od(xB/o), [17.25] 


where the second equality follows because ®(xB/c)A(xB/o) = $(xB/c). This equation shows that 
when y follows a Tobit model, E(y|x) is a nonlinear function of x and B. Although it is not obvi- 
ous, the right-hand side of equation (17.25) can be shown to be positive for any values of x and B. 
Therefore, once we have estimates of B, we can be sure that predicted values for y—that is, estimates 
of E(y|x)—are positive. The cost of ensuring positive predictions for y is that equation (17.25) is more 
complicated than a linear model for E(y|x). Even more importantly, the partial effects from (17.25) 
are more complicated than for a linear model. As we will see, the partial effects of x; on E(yly > 0,x) 
and E(y|x) have the same sign as the coefficient, £;, but the magnitude of the effects depends on the 
values of all explanatory variables and parameters. Because ø appears in (17.25), it is not surprising 
that the partial effects depend on ø, too. 
If Xj is a continuous variable, we can find the partial effects using calculus. First, 


3 dx 
dE(yly > 0,x)/ax; = B; + By Po 


assuming that x; is not functionally related to other regressors. By differentiating A(c) = 6(c)/®(c) 
and using d®/dc = ¢(c) and dø/dc = —cd(c), it can be shown that dA/de = —A(c)[c + A(c)]. 
Therefore, 


JE(yly > 0,x)/ðx; = Bll — A(xB/o)[xB/o + A(xB/o) |}. [17.26] 
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This shows that the partial effect of x, on E(yly > 0,x) is not determined just by 8). The adjust- 
ment factor is given by the term in brackets, {-}, and depends on a linear function of x, 
xB/o = (By + Bix, +- + B,x,)/o. It can be shown that the adjustment factor is strictly between 
zero and one. In practice, we can estimate (17.26) by plugging in the MLEs of the 6; and ø. As with 
logit and probit models, we must plug in values for the x;, usually the mean values or other interesting 
values. 

Equation (17.26) reveals a subtle point that is sometimes lost in applying the Tobit model to cor- 
ner solution responses: the parameter o appears directly in the partial effects, so having an estimate of 
o is crucial for estimating the partial effects. Sometimes, ø is called an “ancillary” parameter (which 
means it is auxiliary, or unimportant). Although it is true that the value of ø does not affect the sign of 
the partial effects, it does affect the magnitudes, and we are often interested in the economic impor- 
tance of the explanatory variables. Therefore, characterizing ø as ancillary is misleading and comes 
from a confusion between the Tobit model for corner solution applications and applications to true 
data censoring. (For the latter, see Section 17-4.) 

All of the usual economic quantities, such as elasticities, can be computed. For example, the elas- 
ticity of y with respect to x,, conditional on y > 0, is 


dE(yly >0,x) xı [17.27] 
ax, E(yly > 0,x) . 


This can be computed when x, appears in various functional forms, including level, logarithmic, and 
quadratic forms. 

If x, is a binary variable, the effect of interest is obtained as the difference between E(y|y > 0,x), 
with x; = 1 and x, = 0. Partial effects involving other discrete variables (such as number of children) 
can be handled similarly. 

We can use (17.25) to find the partial derivative of E(y|x) with respect to continuous x;. This 
derivative accounts for the fact that people starting at y = 0 might choose y > 0 when x; changes: 


GEE) oO ee iene aa e [17.28] 
Ox; OX; OX; 
Because P(y > 0|x) = ®(xB/c), 
P 0 
a 0> a (B/o) $(xB/o), [17.29] 


J 


so we can estimate each term in (17.28), once we plug in the MLEs of the 6; and ø and particular 
values of the x. 

Remarkably, when we plug (17.26) and (17.29) into (17.28) and use the fact that ®(c)A(c) = (c) 
for any c, we obtain 


dE(yIx) 


OX; 


= B&(xB/o). [17.30] 


Equation (17.30) allows us to roughly compare OLS and Tobit estimates. [Equation (17.30) also can 
be derived directly from equation (17.25) using the fact that db(z)/dz = —z(z).] The OLS slope 
coefficients, say, Y;, from the regression of y; on Xj, Xj, -- +, Xie, i = 1,. wee n—that is, using all of the 
data—are direct estimates of JE( ylx)/ðx;. To make the Tobit coefficient, 6;, comparable to ¥;, we must 
multiply £; by an adjustment factor. 

As in the probit and logit cases, there are two common approaches for computing an adjustment 
factor for obtaining partial effects—at least for continuous explanatory variables. Both are based on 
equation (17.30). First, the PEA is obtained by evaluating ®(xĝ/ô), which we denote ®(xB/é-). We 
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can then use this single factor to multiply the coefficients on the continuous explanatory variables. 
The PEA has the same drawbacks here as in the probit and logit cases: we may not be interested in 
the partial effect for the “average” because the average is either uninteresting or meaningless. Plus, 
we must decide whether to use averages of nonlinear functions or plug the averages into the nonlinear 
functions. 

The average partial effect, APE, is preferred in most cases. Here, we compute the scale factor 
as n'i ®(x,B/6). Unlike the PEA, the APE does not require us to plug in a fictitious or non- 
existent unit from the population, and there are no decisions to make about plugging averages into 
nonlinear functions. Like the PEA, the APE scale factor is always between zero and one because 
0 < ®(xB/é) < 1 for any values of the explanatory variables. In fact, P(y; > O|x;) = ®(x,B/6), and 
so the APE scale factor and the PEA scale factor tend to be closer to one when there are few observa- 
tions with y; = 0. In the case that y; > 0 for all i, the Tobit and OLS estimates of the parameters are 
identical. [Of course, if y; > 0 for all i, we cannot justify the use of a Tobit model anyway. Using log 
(y,) in a linear regression model makes much more sense. ] 

Unfortunately, for discrete explanatory variables, comparing OLS and Tobit estimates is not so 
easy (although using the scale factor for continuous explanatory variables often is a useful approxi- 
mation). For Tobit, the partial effect of a discrete explanatory variable, for example, a binary variable, 
should really be obtained by estimating E(y|x) from equation (17.25). For example, if x, is a binary, 
we should first plug in x; = 1 and then x, = 0. If we set the other explanatory variables at their sam- 
ple averages, we obtain a measure analogous to (17.16) for the logit and probit cases. If we compute 
the difference in expected values for each individual, and then average the difference, we get an APE 
analogous to (17.17). Fortunately, many modern statistical packages routinely compute the APES 
for fairly complicated models, including the Tobit model, and allow both continuous and discrete 
explanatory variables. 


Married Women’s Annual Labor Supply 


The file MROZ includes data on hours worked for 753 married women, 428 of whom worked for a 
wage outside the home during the year; 325 of the women worked zero hours. For the women who 
worked positive hours, the range is fairly broad, extending from 12 to 4,950. Thus, annual hours 
worked is a good candidate for a Tobit model. We also estimate a linear model (using all 753 obser- 
vations) by OLS, and compute the heteroskedasticity-robust standard errors. The results are given in 
Table 17.3. 

This table has several noteworthy features. First, the Tobit coefficient estimates have the same 
sign as the corresponding OLS estimates, and the statistical significance of the estimates is similar. 
(Possible exceptions are the coefficients on nwifeinc and kidsge6, but the t statistics have similar mag- 
nitudes.) Second, though it is tempting to compare the magnitudes of the OLS and Tobit estimates, 
this is not very informative. We must be careful not to think that, because the Tobit coefficient on 
kidsIt6 is roughly twice that of the OLS coefficient, the Tobit model implies a much greater response 
of hours worked to young children. 

We can multiply the Tobit estimates by appropriate eee factors to make them roughly 
comparable to the OLS estimates. The APE scale factor n~ ' > i (xê) turns out to be about 
.589, which we can use to obtain the average partial effects for the Tobit estimation. If, for example, 
we multiply the educ coefficient by .589 we get .589(80.65) ~ 47.50 (that is, 47.5 hours more), 
which is quite a bit larger than the OLS partial effect, about 28.8 hours. Table 17.4 contains the 
APEs for all variables, where the APEs for the linear model are simply the OLS coefficients except 
for the variable exper, which appears as a quadratic. The APEs and their standard errors, obtained 
from Stata® 13, are rounded to two decimal places and because of rounding can differ slightly from 
what is obtained by multiplying .589 by the reported Tobit coefficient. The Tobit APEs for nwifeinc, 
educ, and kids/t6 are all substantially larger in magnitude than the corresponding OLS coefficients. 
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TABLE 17.3 OLS and Tobit Estimation of Annual Hours Worked 


Dependent Variable: hours 


Independent Variables Linear (OLS) Tobit (MLE) 
nwifeinc —3.45 —8.81 
(2.24) (4.46) 
educ 28.76 80.65 
(13.04) (21.58) 
exper 65.67 131.56 
(10.79) (17.28) 
exper? —.700 —1.86 
(.372) (0.54) 
age —30.51 —54.41 
(4.24) (7.42) 
kidsit6 —442.09 —894.02 
(57.46) (111.88) 
kidsge6 Z328 —16.22 
(22.80) (38.64) 
constant 1,330.48 965.31 
(274.88) (446.44) 
Log-likelihood value = —3,819.09 
R-squared .266 274 
ô 750.18 1,122.02 


The APEs for exper and age are similar, and for kidsge6, which is nowhere close to being statistically 


significant, the Tobit APE is smaller in magnitude. 


If, instead, we want the estimated effect of another year of education starting at the average 
values of all explanatory variables, then we compute the PEA scale factor ®(xB/é). This turns out 
to be about .645 [when we use the squared average of experience, (exper), rather than the average 
of exper’). This partial effect, which is about 52 hours, is almost twice as large as the OLS estimate. 

We have reported an R-squared for both the linear regression and the Tobit models. The R-squared 
for OLS is the usual one. For Tobit, the R-squared is the square of the correlation coefficient between 


TABLE 17.4 Average Partial Effects for the Hours Worked Models 


Independent Variables Linear Tobit 
nwifeinc —3.45 —5.19 
(2.24) (2.62) 

educ 28.76 47.47 
(13.04) (12.62) 

exper 50.78 48.79 
(4.45) (3.59) 

age —30.51 —32.03 
(4.24) (4.29) 

kidsit6 —442.09 —526.28 
(57.46) (64.71) 

kidsge6 — 32.78 —9.55 
(22.80) (22.75) 
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y; and $,, where $, = ©(x,B/6-)x,B + &6(x,B/G) is the estimate of E(y|x = x;). This is motivated by 
the fact that the usual R-squared for OLS is equal to the squared correlation between the y; and the 
fitted values [see equation (3.29)]. In nonlinear models such as the Tobit model, the squared correla- 
tion coefficient is not identical to an R-squared based on a sum of squared residuals as in (3.28). This 
is because the fitted values, as defined earlier, and the residuals, y; — ĵ; are not uncorrelated in the 
sample. An R-squared defined as the squared correlation coefficient between y; and ĵ; has the advan- 
tage of always being between zero and one; an R-squared based on a sum of squared residuals need 
not have this feature. 

We can see that, based on the R-squared measures, the Tobit conditional mean function fits the 
hours data somewhat, but not substantially, better. However, we should remember that the Tobit esti- 
mates are not chosen to maximize an R-squared—they maximize the log-likelihood function—whereas 
the OLS estimates are the values that do produce the highest R-squared given the linear functional form. 

By construction, all of the Tobit fitted values for hours are positive. By contrast, 39 of the OLS fit- 
ted values are negative. Although negative predictions are of some concern, 39 out of 753 is just over 
5% of the observations. It is not entirely clear how negative fitted values for OLS translate into dif- 
ferences in estimated partial effects. Figure 17.3 plots estimates of E(hours|x) as a function of educa- 
tion; for the Tobit model, the other explanatory variables are set at their average values. For the linear 
model, the equation graphed is hours = 387.19 + 28.76 educ. For the Tobit model, the equation gra- 
phed is hours = ®[(—694.12 + 80.65 educ)/1,122.02]- (—694.12 + 80.65 educ) + 1,122.02: ¢ 
[(—694.12 + 80.65 educ)/1,122.02]. As can be seen from the figure, the linear model gives notably 
higher estimates of the expected hours worked at even fairly high levels of education. For example, 
at eight years of education, the OLS predicted value of hours is about 617.5, while the Tobit estimate 
is about 423.9. At 12 years of education, the predicted hours are about 732.7 and 598.3, respectively. 
The two prediction lines cross after 17 years of education, but no woman in the sample has more than 
17 years of education. The increasing slope of the Tobit line clearly indicates the increasing marginal 
effect of education on expected hours worked. 


FIGURE 17.3 Estimated expected values of hours with respect to education for the linear 


and Tobit models. 
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17-2b Specification Issues in Tobit Models 


The Tobit model, and in particular the formulas for the expectations in (17.24) and (17.25), rely 
crucially on normality and homoskedasticity in the underlying latent variable model. When 
E(y|x) = Bo + Bix, + °°: + BX we know from Chapter 5 that conditional normality of y does not 
play a role in unbiasedness, consistency, or large sample inference. Heteroskedasticity does not affect 
unbiasedness or consistency of OLS, although we must compute robust standard errors and test sta- 
tistics to perform approximate inference. In a Tobit model, if any of the assumptions in (17.18) fail, 
then it is hard to know what the Tobit MLE is estimating. Nevertheless, for moderate departures from 
the assumptions, the Tobit model is likely to provide good estimates of the partial effects on the con- 
ditional means. It is possible to allow for more general assumptions in (17.18), but such models are 
much more complicated to estimate and interpret. 

One potentially important limitation of the Tobit model, at least in certain applications, is that 
the expected value conditional on y > 0 is closely linked to the probability that y > 0. This is clear 
from equations (17.26) and (17.29). In particular, the effect of x; on P(y > 0|x) is proportional to B; 
as is the effect on E(y|y > 0,x), where both functions multiplying B; are positive and depend on 
x only through xB/o. This rules out some interesting possibilities. For example, consider the relation- 
ship between amount of life insurance coverage and a person’s age. Young people may be less likely 
to have life insurance at all, so the probability that y > O increases with age (at least up to a point). 
Conditional on having life insurance, the value of policies might decrease with age, since life insur- 
ance becomes less important as people near the end of their lives. This possibility is not allowed for 
in the Tobit model. 

One way to informally evaluate whether the Tobit model is appropriate is to estimate a pro- 
bit model where the binary outcome, say, w, equals one if y > 0, and w = 0 if y = 0. Then, from 
(17.21), w follows a probit model, where the coefficient on x; is y; = B/o. This means we can esti- 
mate the ratio of $; to a by probit, for each j. If the Tobit model holds, the probit estimate, Ẹ;, should 
be “close” to B/G, where Ê; and & are the Tobit estimates. These will never be identical because of 
sampling error. But we can look for certain problematic signs. For example, if 7; is significant and 
negative, but Bi i is positive, the Tobit model might not be appropriate. Or, if y; and Ê; are the same 
sign, but |8,/6| is much larger or smaller than ||, 
worry too much about sign changes or magnitude differences on explanatory variables that are insig- 
nificant in both models. 

In the annual hours worked example, @ = 1,122.02. When we divide the Tobit coefficient on nwi- 
feinc by 6, we obtain —8.81/1,122.02 = —.0079; the probit coefficient on nwifeinc is about —.012, 
which is different, but not dramatically so. On kids/t6, the coefficient estimate over G is about —.797, 
compared with the probit estimate of —.868. Again, this is not a huge difference, but it indicates that 
having small children has a larger effect on the initial labor force participation decision than on how 
many hours a woman chooses to work once she is in the labor force. (Tobit effectively averages these 
two effects together.) We do not know whether the effects are statistically different, but they are of the 
same order of magnitude. 

What happens if we conclude that the Tobit model is inappropriate? There are models, usually 
called hurdle or two-part models, that can be used when Tobit seems unsuitable. These all have the 
property that P(y > 0|x) and E(yly > 0,x) depend on different parameters, so x; can have dissimilar 
effects on these two functions. [See Wooldridge (2010, Chapter 17) for a dësenipton of these models.] 


17-3 The Poisson Regression Model 


Another kind of nonnegative dependent variable is a count variable, which can take on nonnegative 
integer values: {0, 1, 2, . . .}. We are especially interested in cases where y takes on relatively few 
values, including zero. Examples include the number of children ever born to a woman, the num- 
ber of times someone is arrested in a year, or the number of patents applied for by a firm in a year. 
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For the same reasons discussed for binary and Tobit responses, a linear model for E(y|x,,.. . , x4) 
might not provide the best fit over all values of the explanatory variables. (Nevertheless, it is always 
informative to start with a linear model, as we did in Example 3.5.) 

As with a Tobit outcome, we cannot take the logarithm of a count variable because it takes on the 
value zero. A profitable approach is to model the expected value as an exponential function: 


E(y|x,,4%5,...,%,) = exp(By + Bix, + + Bixa). [17.31] 


Because exp(-) is always positive, (17.31) ensures that predicted values for y will also be positive. 
The exponential function is graphed in Figure A.5 of Math Refresher A. 

Although (17.31) is more complicated than a linear model, we basically already know how to 
interpret the coefficients. Taking the log of equation (17.31) shows that 


loglE(y|x,, x... ..%,)] = Bo + Bix) + + Bex [17.32] 


so that the log of the expected value is linear. Therefore, using the approximation properties of the log 
function that we have used often in previous chapters, 


%AE(y|x) ~ (100B;) Ax, 


In other words, 1006; is roughly the percentage change in E(y|x), given a one-unit increase in Xj. 
Sometimes, a more accurate estimate is needed, and we can easily find one by looking at discrete 
changes in the expected value. Keep all explanatory variables except x, fixed and let x? be the initial 
value and x; the subsequent value. Then, the proportionate change in the expected value is 


[exp(Bo + X,— By + Byx;)/exp(Bo + X,— By + Bex) | Si exp(B,Ax;) = 1, 


where x,_,6,_, is shorthand for Bix; +- + B,—1%,_, and Ax, = x} — x}. When Ax, = 1—for 
example, if x, is a dummy variable that we e from zero to EA the change is er poy =i: 
Given Bo we obtain exp(B,) — 1 and multiply this by 100 to turn the proportionate change into a 
percentage change. 

If, say, x; = = log(z ;) for some variable z; > 0, then its coefficient, 6,, is interpreted as an elasticity 
with respect to Ze Technically, it is an deic of the expected value of y with respect to z; because 
we cannot compute the percentage change in cases where y = 0. For our purposes, the disiinétion i is 
unimportant. The bottom line is that, for practical purposes, we can interpret the coefficients in equa- 
tion (17.31) as if we have a linear model, with log(y) as the dependent variable. There are some subtle 
differences that we need not study here. 

Because (17.31) is nonlinear in its parameters—remember, exp(-) is a nonlinear function—we 
cannot use linear regression methods. We could use nonlinear least squares, which, just as with OLS, 
minimizes the sum of squared residuals. It turns out, however, that all standard count data distribu- 
tions exhibit heteroskedasticity, and nonlinear least squares does not exploit this [see Wooldridge 
(2010, Chapter 12)]. Instead, we will rely on maximum likelihood and the important related method 
of quasi-maximum likelihood estimation. 

In Chapter 4, we introduced normality as the standard distributional assumption for linear regres- 
sion. The normality assumption is reasonable for (roughly) continuous dependent variables that can 
take on a large range of values. A count variable cannot have a normal distribution (because the nor- 
mal distribution is for continuous variables that can take on all values), and if it takes on very few 
values, the distribution can be very different from normal. Instead, the nominal distribution for count 
data is the Poisson distribution. 

Because we are interested in the effect of explanatory variables on y, we must look at the Poisson 
distribution conditional on x. The Poisson distribution is entirely determined by its mean, so we only 
need to specify E(y|x). We assume this has the same form as (17.31), which we write in shorthand as 
exp(xf). Then, the probability that y equals the value h, conditional on x, is 


P(y = h|x) = exp[—exp(xB) |[exp(xB) |’/h!,h = 0,1,..., 
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where A! denotes factorial (see Math Refresher B). This distribution, which is the basis for the Poisson 
regression model, allows us to find conditional probabilities for any values of the explanatory vari- 
ables. For example, P(y = 0|x) = exp[—exp(xf) |. Once we have estimates of the 6, we can plug 
them into the probabilities for various values of x. 

Given a random sample {(x,, y;): i = 1,2,...,}, we can construct the log-likelihood function: 


2B) = D6(B) = Diya ~ exp(xB)} [17.33] 


where we drop the term — log(y;!) because it does not depend on B. This log-likelihood function is 
simple to maximize, although the Poisson MLEs are not obtained in closed form. 

The standard errors of the Poisson estimates Ê; are easy to obtain after the log-likelihood function 
has been maximized; the formula is in Appendix 17B. These are reported along with the 6; by any 
software package. 

As with the probit, logit, and Tobit models, we cannot directly compare the magnitudes 
of the Poisson estimates of an exponential function with the OLS estimates of a linear func- 
tion. Nevertheless, a rough comparison is possible, at least for continuous explanatory vari- 
ables. If (17.31) holds, then the partial effect of x; with respect to E(ylx,x,...,%%) is 
IE(ylx X2.. xy); = exp(Bo + Bix, + + Byx,)-B;. This expression follows from the chain 
rule in calculus because the derivative of the exponential function is just the exponential function. If 
we let 7; denote an OLS slope coefficient from the regression y on x), X2, . . . , X,, then we can roughly 
compare the magnitude of the 7; and the average partial effect for an exponential regression function. 
Interestingly, the APE scale factor in this case,n~')"_ ,exp(By + Bixa + + Bixi) = n'i Dis 
is simply the sample average y of y;. where we define the fitted values as }, = exp(B) + x,B). In 
other words, for Poisson regression with an exponential mean function, the average of the fitted val- 
ues is the same as the average of the original outcomes on y,—just as in the linear regression case. 
This makes it simple to scale the Poisson estimates, A to make them comparable to the correspond- 
ing OLS estimates, ¥;: for a continuous explanatory variable, we can compare ¥; to y: 6;. 

Although Poisson MLE analysis is a natural first step for count data, it is often much too 
restrictive. All of the probabilities and higher moments of the Poisson distribution are determined 
entirely by the mean. In particular, the variance is equal to the mean: 


Var(y|x) = E(y|x). [17.34] 


This is restrictive and has been shown to be violated in many applications. Fortunately, the Poisson 
distribution has a very nice robustness property: whether or not the Poisson distribution holds, we 
still get consistent, asymptotically normal estimators of the B;. [See Wooldridge (2010, Chapter 18) 
for details.] This is analogous to the OLS estimator, which is consistent and asymptotically normal 
whether or not the normality assumption holds; yet OLS is the MLE under normality. 

When we use Poisson MLE, but we do not assume that the Poisson distribution is entirely cor- 
rect, we call the analysis quasi-maximum likelihood estimation (QMLE). The Poisson QMLE is 
very handy because it is programmed in many econometrics packages. However, unless the Poisson 
variance assumption (17.34) holds, the standard errors need to be adjusted. 

A simple adjustment to the standard errors is available when we assume that the variance is pro- 
portional to the mean: 


Var(ylx) = o°E(yIx), [17.35] 


where g? > 0 is an unknown parameter. When g? = 1, we obtain the Poisson variance assumption. 
When g° > 1, the variance is greater than the mean for all x; this is called overdispersion because 
the variance is larger than in the Poisson case, and it is observed in many applications of count regres- 
sions. The case 0” < 1, called underdispersion, is less common but is allowed in (17.35). 
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Under (17.35), it is easy to adjust the usual Poisson MLE standard errors. Let Ê; denote the Poisson 
QMLE and define the residuals as i; = y; — $;, where ĵ; = exp(Bo + Bixa + + B,xXq) is the fitted 
value. As usual, the residual for observation i is the difference between y; and its fitted value. A con- 
sistent estimator of ø? is (n — k — 1) '>)"_,i#7/5,, where the division by $; is the proper heteroske- 
dasticity adjustment and n — k — 1 is the df given n observations and k + 1 estimates Bos B ee Be 
Letting ĉ be the positive square root of 6°, we multiply the usual Poisson standard errors by &. If & 
is notably greater than one, the corrected standard errors can be much bigger than the nominal, gener- 
ally incorrect, Poisson MLE standard errors. 

Even (17.35) is not entirely general. Just as in the linear model, we can obtain standard errors 
for the Poisson QMLE that do not restrict the variance at all. [See Wooldridge (2010, Chapter 18) for 
further explanation. ] 

Under the Poisson distributional assumption, we 
can use the likelihood ratio statistic to test exclusion 
GOING FURTHER 17.4 restrictions, which, as always, has the form in (17.12). 
If we have g exclusion restrictions, the statistic is dis- 
tributed approximately as x under the null. Under the 
ike veal Passan Mle semed crore? less restrictive assumption (17.35), a simple adjust- 
How will the quasi-LR statistic compare with ment is available (and then we call the statistic the 
the usual LR statistic? quasi-likelihood ratio statistic): we divide (17.12) by 
6’, where G” is obtained from the unrestricted model. 


Suppose that we obtain 6? = 2. How will 
the adjusted standard errors compare with 


Poisson Regression for Number of Arrests 


We now apply the Poisson regression model to the arrest data in CRIME1, used, among other places, 
in Example 9.1. The dependent variable, narr86, is the number of times a man is arrested during 
1986. This variable is zero for 1,970 of the 2,725 men in the sample, and only eight values of narr&6 
are greater than five. Thus, a Poisson regression model is more appropriate than a linear regression 
model. Table 17.5 also presents the results of OLS estimation of a linear regression model. 

The standard errors for OLS are the usual ones; we could certainly have made these robust to het- 
eroskedasticity. The standard errors for Poisson regression are the usual maximum likelihood stand- 
ard errors. Because @ = 1.232, the standard errors for Poisson regression should be inflated by this 
factor (so each corrected standard error is about 23% higher). For example, a more reliable standard 
error for tottime is 1.23(.015) = .0185, which gives a ż statistic of about 1.3. The adjustment to the 
standard errors reduces the significance of all variables, but several of them are still very statistically 
significant. 

The OLS and Poisson coefficients are not directly comparable, and they have very different 
meanings. For example, the coefficient on pcnv implies that, if Apcnv = .10, the expected number 
of arrests falls by .013 (pcnv is the proportion of prior arrests that led to conviction). The Poisson 
coefficient implies that Apcnv = .10 reduces expected arrests by about 4% [.402(.10) = .0402, and 
we multiply this by 100 to get the percentage effect]. As a policy matter, this suggests we can reduce 
overall arrests by about 4% if we can increase the probability of conviction by .1. 

The Poisson coefficient on black implies that, other factors being equal, the expected number 
of arrests for a black man is estimated to be about 100-[exp(.661) — 1] = 93.7% higher than for a 
white man with the same values for the other explanatory variables. 

As with the Tobit application in Table 17.3, we report an R-squared for Poisson regression: the 
squared correlation coefficient between y; and ĵ; = expl Êo T Bixa TRT Bixa). The motivation for 
this goodness-of-fit measure is the same as for the Tobit model. We see that the exponential regression 
model, estimated by Poisson QMLE, fits slightly better. Remember that the OLS estimates are chosen 
to maximize the R-squared, but the Poisson estimates are not. (They are selected to maximize the log- 
likelihood function.) 
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TABLE 17.5 Determinants of Number of Arrests for Young Men 
Dependent Variable: narr86 


Independent Variables Linear (OLS) Exponential (Poisson QMLE) 


pcnv Aloe —.402 
(.040) (.085) 
avgsen —.011 —.024 
(.012) (.020) 
tottime 012 024 
(.009) (.015) 
ptime86 —.041 —.099 
(.009) (.021) 
qemp86 —.051 —.038 
(.014) (.029) 
inc86 —.0015 —.0081 
(.0003) (.0010) 
black 327 .661 
(.045) (.074) 
hispan 194 .500 
(.040) (.074) 
born60 —.022 —.051 
(.033) (.064) 
constant 577 —.600 
(.038) (.067) 
Log-likelihood value = —2,248.76 


R-squared .073 .077 
o 829 1.232 


Other count data regression models have been proposed and used in applications, which generalize 
the Poisson distribution in a variety of ways. If we are interested in the effects of the x; on the mean 
response, there is little reason to go beyond Poisson regression: it is simple, often gives good results, 
and has the robustness property discussed earlier. In fact, we could apply Poisson regression to a y 
that is a Tobit-like outcome, provided (17.31) holds. This might give good estimates of the mean 
effects. Extensions of Poisson regression are more useful when we are interested in estimating prob- 
abilities, such as P(y > 1|x). [See, for example, Cameron and Trivedi (1998).] 


17-4 Censored and Truncated Regression Models 


The models in Sections 17-1, 17-2, and 17-3 apply to various kinds of limited dependent variables 
that arise frequently in applied econometric work. In using these methods, it is important to remember 
that we use a probit or logit model for a binary response, a Tobit model for a corner solution out- 
come, or a Poisson regression model for a count response because we want models that account for 
important features of the distribution of y. There is no issue of data observability. For example, in the 
Tobit application to women’s labor supply in Example 17.2, there is no problem with observing hours 
worked: it is simply the case that a nontrivial fraction of married women in the population choose not 
to work for a wage. In the Poisson regression application to annual arrests, we observe the dependent 
variable for every young man in a random sample from the population, but the dependent variable can 
be zero as well as other small integer values. 
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Unfortunately, the distinction between lumpiness in an outcome variable (such as taking on the 
value zero for a nontrivial fraction of the population) and problems of data censoring can be confus- 
ing. This is particularly true when applying the Tobit model. In this book, the standard Tobit model 
described in Section 17-2 is only for corner solution outcomes. But the literature on Tobit models 
usually treats another situation within the same framework: the response variable has been censored 
above or below some threshold. Typically, the censoring is due to survey design and, in some cases, 
institutional constraints. Rather than treat data censoring problems along with corner solution out- 
comes, we solve data censoring by applying a censored regression model. Essentially, the problem 
solved by a censored regression model is one of missing data on the response variable, y. Although we 
are able to randomly draw units from the population and obtain information on the explanatory vari- 
ables for all units, the outcome on y; is missing for some i. Still, we know whether the missing values 
are above or below a given threshold, and this knowledge provides useful information for estimating 
the parameters. 

A truncated regression model arises when we exclude, on the basis of y, a subset of the popula- 
tion in our sampling scheme. In other words, we do not have a random sample from the underlying 
population, but we know the rule that was used to include units in the sample. This rule is determined 
by whether y is above or below a certain threshold. We explain more fully the difference between cen- 
sored and truncated regression models later. 


17-4a Censored Regression Models 


While censored regression models can be defined without distributional assumptions, in this subsec- 
tion we study the censored normal regression model. The variable we would like to explain, y, 
follows the classical linear model. For emphasis, we put an 7 subscript on a random draw from the 
population: 


y; = Bo + xB + u; ulx;, c; ~ Normal(0, o°) [17.36] 


[17.37] 


w; = min(y,¢;). 


Rather than observing y; we observe it only if it is less than a censoring value, c;. Notice that (17.36) 
includes the assumption that u; is independent of c;. (For concreteness, we explicitly consider censor- 
ing from above, or right censoring; the problem of censoring from below, or left censoring, is handled 
similarly.) 

One example of right data censoring is top coding. When a variable is top coded, we know its 
value only up to a certain threshold. For responses greater than the threshold, we only know that 
the variable is at least as large as the threshold. For 
example, in some surveys family wealth is top coded. 
Suppose that respondents are asked their wealth, 
but people are allowed to respond with “more than 
$500,000.” Then, we observe actual wealth for those 
respondents whose wealth is less than $500,000 but 
not for those whose wealth is greater than $500,000. 
In this case, the censoring threshold, c;, is the same 
for all 7. In many situations, the censoring threshold 
changes with individual or family characteristics. 

If we observed a random sample for (x, y), we 


s, GOING FURTHER 17.5 


Let mvp; be the marginal value product for 
worker /; this is the price of a firm’s good 
multiplied by the marginal product of the 
worker. Assume mvp; is a linear function 
of exogenous variables, such as educa- 
tion, experience, and so on, and an unob- 
servable error. Under perfect competition 
and without institutional constraints, each 
worker is paid his or her marginal value 


product. Let minwage; denote the minimum 
wage for worker /, which varies by state. We 
observe wage; which is the larger of mvp; 
and minwage;. Write the appropriate model 
for the observed wage. 


would simply estimate B by OLS, and statistical 
inference would be standard. (We again absorb the 
intercept into x for simplicity.) The censoring causes 
problems. Using arguments similar to the Tobit 
model, an OLS regression using only the uncensored 
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observations—that is, those with y; < c;—produces inconsistent estimators of the 6;. An OLS regres- 
sion of w, on x;, using all observations, does not consistently estimate the £,, unless there is no cen- 
soring. This is similar to the Tobit case, but the problem is much different. In the Tobit model, we 
are modeling economic behavior, which often yields zero outcomes; the Tobit model is supposed to 
reflect this. With censored regression, we have a data collection problem because, for some reason, 
the data are censored. 

Under the assumptions in (17.36) and (17.37), we can estimate B (and a’) by maximum like- 
lihood, given a random sample on (x;, w;). For this, we need the density of w; given (x; c;). For 
uncensored observations, w; = y; and the density of w; is the same as that for y; Normal(x,B,o7). 
For censored observations, we need the probability that w; equals the censoring value, c;, given x;: 


P(w; = cix;) = P(y; = cilx;) = P(u; = c — xB) = 1 — |(c; — x,B)/o]. 


We can combine these two parts to obtain the density of w; given x; and c;: 


fiwlx;,c;) = 1 — l(c- xB], w= c, [17.38] 


= (1/0 )¢ġ |(w — x,B)/o], w < c; [17.39] 


The log-likelihood for observation i is obtained by taking the natural log of the density for each i. We 
can maximize the sum of these across i, with respect to the £, and ø, to obtain the MLEs. 

It is important to know that we can interpret the £; just as in a linear regression model under ran- 
dom sampling. This is much different than Tobit applications to corner solution responses, where the 
expectations of interest are nonlinear functions of the 6. 

An important application of censored regression models is duration analysis. A duration is a 
variable that measures the time before a certain event occurs. For example, we might wish to explain 
the number of days before a felon released from prison is arrested. For some felons, this may never 
happen, or it may happen after such a long time that we must censor the duration in order to analyze 
the data. 

In duration applications of censored normal regression, as well as in top coding, we often use the 
natural log as the dependent variable, which means we also take the log of the censoring threshold in 
(17.37). As we have seen throughout this text, using the log transformation for the dependent variable 
causes the parameters to be interpreted as percentage changes. Further, as with many positive variables, 
the log of a duration typically has a distribution closer to (conditional) normal than the duration itself. 


Duration of Recidivism 


The file RECID contains data on the time in months until an inmate in a North Carolina prison is 
arrested after being released from prison; call this durat. Some inmates participated in a work pro- 
gram while in prison. We also control for a variety of demographic variables, as well as for measures 
of prison and criminal history. 

Of 1,445 inmates, 893 had not been arrested during the period they were followed; therefore, 
these observations are censored. The censoring times differed among inmates, ranging from 70 to 81 
months. 

Table 17.6 gives the results of censored normal regression for log(durat). Each of the coeffi- 
cients, when multiplied by 100, gives the estimated percentage change in expected duration, given a 
ceteris paribus increase of one unit in the corresponding explanatory variable. 

Several of the coefficients in Table 17.6 are interesting. The variables priors (number of prior 
convictions) and tserved (total months spent in prison) have negative effects on the time until the 
next arrest occurs. This suggests that these variables measure proclivity for criminal activity rather 
than representing a deterrent effect. For example, an inmate with one more prior conviction has a 
duration until next arrest that is almost 14% less. A year of time served reduces duration by about 
100- 12(.019) = 22.8%. A somewhat surprising finding is that a man serving time for a felony has an 
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TABLE 17.6 Censored Regression Estimation of Criminal Recidivism 


Dependent Variable: log(durat) 
Independent Variables Coefficient (Standard Error) 
workprg —.063 
(.120) 
priors Selon 
(.021) 
tserved —.019 
(.003) 
felon 444 
(.145) 
alcohol —.635 
(.144) 
drugs —.298 
(133) 
black —.543 
(.117) 
married .341 
(.140) 
educ .023 
(.025) 
age .0039 
(.0006) 
constant 4.099 
(.348) 
Log-likelihood value & — 1,597.06 
1.810 


estimated expected duration that is almost 56% [exp(.444) — 1 = .56] longer than a man serving 
time for a nonfelony. 

Those with a history of drug or alcohol abuse have substantially shorter expected durations until 
the next arrest. (The variables alcohol and drugs are binary variables.) Older men, and men who were 
married at the time of incarceration, are expected to have significantly longer durations until their next 
arrest. Black men have substantially shorter durations, on the order of 42% [exp(—.543) — 1 =~ —.42]. 

The key policy variable, workprg, does not have the desired effect. The point estimate is that, 
other things being equal, men who participated in the work program have estimated recidivism dura- 
tions that are about 6.3% shorter than men who did not participate. The coefficient has a small t 
statistic, so we would probably conclude that the work program has no effect. This could be due to 
a self-selection problem, or it could be a product of the way men were assigned to the program. Of 
course, it may simply be that the program was ineffective. 


In this example, it is crucial to account for the censoring, especially because almost 62% of the 
durations are censored. If we apply straight OLS to the entire sample and treat the censored durations 
as if they were uncensored, the coefficient estimates are markedly different. In fact, they are all shrunk 
toward zero. For example, the coefficient on priors becomes —.059 (se = .009), and that on alcohol 
becomes —.262 (se = .060). Although the directions of the effects are the same, the importance of 
these variables is greatly diminished. The censored regression estimates are much more reliable. 
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There are other ways of measuring the effects of each of the explanatory variables in Table 17.6 
on the duration, rather than focusing only on the expected duration. A treatment of modern duration 
analysis is beyond the scope of this text. [For an introduction, see Wooldridge (2010, Chapter 22).] 

If any of the assumptions of the censored normal regression model are violated—in particular, 
if there is heteroskedasticity or nonnormality in u—the MLEs are generally inconsistent. This shows 
that the censoring is potentially very costly, as OLS using an uncensored sample requires neither 
normality nor homoskedasticity for consistency. There are methods that do not require us to assume a 
distribution, but they are more advanced. [See Wooldridge (2010, Chapter 19).] 


17-4b Truncated Regression Models 


The truncated regression model differs in an important respect from the censored regression model. In 
the case of data censoring, we do randomly sample units from the population. The censoring problem 
is that, while we always observe the explanatory variables for each randomly drawn unit, we observe 
the outcome on y only when it is not censored above or below a given threshold. With data truncation, 
we restrict attention to a subset of the population prior to sampling; there is a part of the population 
for which we observe no information. In particular, we have no information on explanatory vari- 
ables. The truncated sampling scenario typically arises when a survey targets a particular subset of the 
population and, perhaps due to cost considerations, entirely ignores the other part of the population. 
Subsequently, researchers might want to use the truncated sample to answer questions about the entire 
population, but one must recognize that the sampling scheme did not generate a random sample from 
the whole population. 

As an example, Hausman and Wise (1977) used data from a negative income tax experiment to 
study various determinants of earnings. To be included in the study, a family had to have income less 
than 1.5 times the 1967 poverty line, where the poverty line depended on family size. Hausman and 
Wise wanted to use the data to estimate an earnings equation for the entire population. 

The truncated normal regression model begins with an underlying population model that satis- 
fies the classical linear model assumptions: 


y = Bo + xB + u, ulx ~ Normal(0,o”). [17.40] 


Recall that this is a strong set of assumptions, because u must not only be independent of x, but also 
normally distributed. We focus on this model because relaxing the assumptions is difficult. 

Under (17.40) we know that, given a random sample from the population, OLS is the most effi- 
cient estimation procedure. The problem arises because we do not observe a random sample from 
the population: Assumption MLR.2 is violated. In particular, a random draw (x; y;) is observed 
only if y; = c; where c; is the truncation threshold that can depend on exogenous variables—in par- 
ticular, the x;. (In the Hausman and Wise example, c; depends on family size.) This means that, if 
{(x;, y;):i = 1,..., n} is our observed sample, then y; is necessarily less than or equal to c;. This dif- 
fers from the censored regression model: in a censored regression model, we observe x; for any ran- 
domly drawn observation from the population; in the truncated model, we only observe x; if y; = c; 

To estimate the 6; (along with g), we need the distribution of y; given that y; 5 c; and x;. This is 
written as 


_ SO1x:B.0*) 


.c) = 1ni zG 17.41 
g(ylx;.c;) F(c|x,B,07)’ Ci, [ ] 


where f(y|x;B,o>) denotes the normal density with mean 8, + x, and variance o°, and F(c,x,B,o7) 
is the normal cdf with the same mean and variance, evaluated at c;. This expression for the density, 
conditional on y; S c; makes intuitive sense: it is the population density for y, given x, divided by 
the probability that y; is less than or equal to c; (given x;), P(y; = c,|x;,). In effect, we renormalize the 
density by dividing by the area under f(-|x;8,o7) that is to the left of c;. 
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If we take the log of (17.41), sum across all i, and maximize the result with respect to the 
B; and a’, we obtain the maximum likelihood estimators. This leads to consistent, approximately 
normal estimators. The inference, including standard errors and log-likelihood statistics, is standard 
and treated in Wooldridge (2010, Chapter 19). 

We could analyze the data from Example 17.4 as a truncated sample if we drop all data on an 
observation whenever it is censored. This would give us 552 observations from a truncated normal 
distribution, where the truncation point differs across i. However, we would never analyze duration 
data (or top-coded data) in this way, as it eliminates useful information. The fact that we know a lower 
bound for 893 durations, along with the explanatory variables, is useful information; censored regres- 
sion uses this information, while truncated regression does not. 

A better example of truncated regression is given in Hausman and Wise (1977), where they 
emphasize that OLS applied to a sample truncated from above generally produces estimators biased 
toward zero. Intuitively, this makes sense. Suppose that the relationship of interest is between income 
and education levels. If we only observe people whose income is below a certain threshold, we are 
lopping off the upper end. This tends to flatten the estimated line relative to the true regression line 
in the whole population. Figure 17.4 illustrates the problem when income is truncated from above 
at $50,000. Although we observe the data points represented by the open circles, we do not observe 
the data points represented by the darkened circles. A regression analysis using the truncated sample 
does not lead to consistent estimators. Incidentally, if the sample in Figure 17.4 were censored rather 
than truncated—that is, we had top-coded data—we would observe education levels for all points in 
Figure 17.4, but for individuals with incomes above $50,000 we would not know the exact income 
amount. We would only know that income was at least $50,000. In effect, all observations represented 
by the darkened circles would be brought down to the horizontal line at income = 50. 

As with censored regression, if the underlying homoskedastic normal assumption in (17.40) is 
violated, the truncated normal MLE is biased and inconsistent. Methods that do not require these 
assumptions are available; see Wooldridge (2010, Chapter 19) for discussion and references. 


FIGURE 17.4 A true, or population, regression line and the incorrect regression line for 


the truncated population with observed incomes below $50,000. 
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17-5 Sample Selection Corrections 


Truncated regression is a special case of a general problem known as nonrandom sample selection. 
But survey design is not the only cause of nonrandom sample selection. Often, respondents fail to 
provide answers to certain questions, which leads to missing data for the dependent or independent 
variables. Because we cannot use these observations in our estimation, we should wonder whether 
dropping them leads to bias in our estimators. 

Another general example is usually called incidental truncation. Here, we do not observe y 
because of the outcome of another variable. The leading example is estimating the so-called wage 
offer function from labor economics. Interest lies in how various factors, such as education, affect 
the wage an individual could earn in the labor force. For people who are in the workforce, we 
observe the wage offer as the current wage. But for those currently out of the workforce, we do not 
observe the wage offer. Because working may be systematically correlated with unobservables that 
affect the wage offer, using only working people—as we have in all wage examples so far—might 
produce biased estimators of the parameters in the wage offer equation. 

Nonrandom sample selection can also arise when we have panel data. In the simplest case, we 
have two years of data, but, due to attrition, some people leave the sample. This is particularly a prob- 
lem in policy analysis, where attrition may be related to the effectiveness of a program. 


17-5a When Is OLS on the Selected Sample Consistent? 


In Section 9-5, we provided a brief discussion of the kinds of sample selection that can be ignored. 
The key distinction is between exogenous and endogenous sample selection. In the truncated Tobit 
case, we clearly have endogenous sample selection, and OLS is biased and inconsistent. On the other 
hand, if our sample is determined solely by an exogenous explanatory variable, we have exogenous 
sample selection. Cases between these extremes are less clear, and we now provide careful definitions 
and assumptions for them. The population model is 


y = Bo + Bix, + + Boxe + u, E(ulxy, x... , X4) = 0. [17.42] 
It is useful to write the population model for a random draw as 
yi = xB + u; [17.43] 


where we use x;f as shorthand for By + Bixa + Box. +++: + Bix. Now, let n be the size of a ran- 
dom sample from the population. If we could observe y; and each x, for all i, we would simply use 
OLS. Assume that, for some reason, either y; or some of the independent variables are not observed 
for certain i. For at least some observations, we observe the full set of variables. Define a selection 
indicator s; for each i by s; = 1 if we observe all of (y; x;), and s; = O otherwise. Thus, s; = 1 indi- 
cates that we will use the observation in our analysis; s; = 0 means the observation will not be used. 
We are interested in the statistical properties of the OLS estimators using the selected sample, that is, 
using observations for which s; = 1. Therefore, we use fewer than n observations, say, 7. 

It turns out to be easy to obtain conditions under which OLS is consistent (and even unbiased). 
Effectively, rather than estimating (17.43), we can only estimate the equation 


SYi = SXP + su [17.44] 


When s; = 1, we simply have (17.43); when s; = 0, we simply have 0 = 0 + 0, which clearly tells us 
nothing about B. Regressing s;y; on s,x; for i = 1, 2,..., n is the same as regressing y; on x; using the 
observations for which s; = 1. Thus, we can learn about the consistency of the Ê; by studying (17.44) 
on a random sample. 

From our analysis in Chapter 5, the OLS estimators from (17.44) are consistent if the error term 
has zero mean and is uncorrelated with each explanatory variable. In the population, the zero mean 
assumption is E(su) = 0, and the zero correlation assumptions can be stated as 


E[(sx;) (su) ] = E(sxju) = 0, [17.45] 
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where s, x;, and u are random variables representing the population; we have used the fact that S=s 
because s is a binary variable. Condition (17.45) is different from what we need if we observe all vari- 
ables for a random sample: E(x;u) = 0. Therefore, in the population, we need u to be uncorrelated 
with sx;. 

The key condition for unbiasedness is E(sulsx,,..., sx,) = 0. As usual, this is a stronger 
assumption than that needed for consistency. 


If s is a function only of the explanatory variables, then sx; is just a function of xj, 


X2,..., X4; by the conditional mean assumption in (17.42), SX; is also uncorrelated with u. In fact, 
E(sulsx,,..., sx) = sE(ulsx,,..., sx,) = 0, because E(ulx,,...,x,) = 0. This is the case of 
exogenous sample selection, where s; = 1 is determined entirely by x;),..., x. As an example, if 


we are estimating a wage equation where the explanatory variables are education, experience, tenure, 
gender, marital status, and so on—which are assumed to be exogenous—we can select the sample on 
the basis of any or all of the explanatory variables. 

If sample selection is entirely random in the sense that s; is independent of (x;, u;), then 
E(sxu) = E(s)E(xju) = 0, because E(x) = 0 under (17.42). Therefore, if we begin with a random 
sample and randomly drop observations, OLS is still consistent. In fact, OLS is again unbiased in this 
case, provided there is no perfect multicollinearity in the selected sample. 

If s depends on the explanatory variables and additional random terms that are independent of 
x and u, OLS is also consistent and unbiased. For example, suppose that IQ score is an explanatory 
variable in a wage equation, but IQ is missing for some people. Suppose we think that selection can 
be described by s = 1 if JQ = v, and s = 0 if JQ < v, where v is an unobserved random variable that 
is independent of JQ, u, and the other explanatory variables. This means that we are more likely to 
observe an /Q that is high, but there is always some chance of not observing any JQ. Conditional on the 
explanatory variables, s is independent of u, which means that E(ulx,,..., xp 5) = E(ulxy,..., X4), 
and the last expectation is zero by assumption on the population model. If we add the homoskedasticity 
assumption E(u7|x,s) = E(u?) = o°, then the usual OLS standard errors and test statistics are valid. 

So far, we have shown several situations where OLS on the selected sample is unbiased, or at 
least consistent. When is OLS on the selected sample inconsistent? We already saw one example: 
regression using a truncated sample. When the truncation is from above, s; = 1 if y; = c; where c; is 
the truncation threshold. Equivalently, s; = 1 if u; S c; — x;P. Because s; depends directly on u; s; 
and u; will not be uncorrelated, even conditional on x;. This is why OLS on the selected sample does 
not consistently estimate the £;. There are less obvious ways that s and u can be correlated; we con- 
sider this in the next subsection. 

The results on consistency of OLS extend to instrumental variables estimation. If the IVs are 
denoted z; in the population, the key condition for consistency of 2SLS is E(sz,u) = 0, which holds 
if E(u|z,s) = 0. Therefore, if selection is determined entirely by the exogenous variables z, or if s 
depends on other factors that are independent of u and z, then 2SLS on the selected sample is gener- 
ally consistent. We do need to assume that the explanatory and instrumental variables are appropri- 
ately correlated in the selected part of the population. Wooldridge (2010, Chapter 19) contains precise 
statements of these assumptions. 

It can also be shown that, when selection is entirely a function of the exogenous variables, MLE 
of a nonlinear model—such as a logit or probit model—produces consistent, asymptotically normal 
estimators, and the usual standard errors and test statistics are valid. [Again, see Wooldridge (2010, 
Chapter 19).] 


17-5b Incidental Truncation 


As we mentioned earlier, a common form of sample selection is called incidental truncation. We 
again start with the population model in (17.42). However, we assume that we will always observe 
the explanatory variables x;. The problem is, we only observe y for a subset of the population. The rule 
determining whether we observe y does not depend directly on the outcome of y. A leading example 
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is when y = log(wage°), where wage’ is the wage offer, or the hourly wage that an individual could 
receive in the labor market. If the person is actually working at the time of the survey, then we observe 
the wage offer because we assume it is the observed wage. But for people out of the workforce, we 
cannot observe wage’. Therefore, the truncation of wage offer is incidental because it depends on 
another variable, namely, labor force participation. Importantly, we would generally observe all other 
information about an individual, such as education, prior experience, gender, marital status, and so on. 

The usual approach to incidental truncation is to add an explicit selection equation to the popula- 
tion model of interest: 


y = xB + u, E(ulx) = 0 [17.46] 
s=I1[zy+v=0], [17.47] 


where s = 1 if we observe y, and zero otherwise. We assume that elements of x and z are always 
observed, and we write xB = By + Bix, + + Bix; and zy = Yo + YZ Ft FOV Zn 

The equation of primary interest is (17.46), and we could estimate B by OLS given a random 
sample. The selection equation, (17.47), depends on observed variables, z,,, and an unobserved error, v. 
A standard assumption, which we will make, is that z is exogenous in (17.46): 


E(ulx,z) = 0. 


In fact, for the following proposed methods to work well, we will require that x be a strict subset of z: 
any x; is also an element of z, and we have some elements of z that are not also in x. We will see later 
why this is crucial. 

The error term v in the sample selection equation is assumed to be independent of z (and there- 
fore x). We also assume that v has a standard normal distribution. We can easily see that correlation 
between u and v generally causes a sample selection problem. To see why, assume that (u, v) is inde- 
pendent of z. Then, taking the expectation of (17.46), conditional on z and v, and using the fact that x 
is a subset of z gives 


E(ylz,v) = xB + E(ulz,v) = xB + E(ulv), 


where E(u|z,v) = E(ulv) because (u, v) is independent of z. Now, if u and v are jointly normal (with 
zero mean), then E(u|v) = pv for some parameter p. Therefore, 


E(yl|z,v) = xB + pv. 


We do not observe v, but we can use this equation to compute E(y 
s = 1. We now have: 


z,s) and then specialize this to 


E(ylz,s) = xB + pE(vlz,s). 


Because s and v are related by (17.47), and v has a standard normal distribution, we can show that 
E(v|z,s) is simply the inverse Mills ratio, A(zy), when s = 1. This leads to the important equation 


E(y|z,s = 1) = xB + pa(zy). [17.48] 


Equation (17.48) shows that the expected value of y, given z and observability of y, is equal to xB, 
plus an additional term that depends on the inverse Mills ratio evaluated at xy. Remember, we hope 
to estimate 6. This equation shows that we can do so using only the selected sample, provided we 
include the term A(zy) as an additional regressor. 

If p = 0, A(zy) does not appear, and OLS of y on x using the selected sample consistently esti- 
mates B. Otherwise, we have effectively omitted a variable, A(zy), which is generally correlated with x. 
When does p = 0? The answer is when u and v are uncorrelated. 

Because y is unknown, we cannot evaluate A(z;y) for each i. However, from the assumptions we 
have made, s given z follows a probit model: 


P(s = 1|z) = (zy). [17.49] 
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Therefore, we can estimate y by probit of s; on z;, using the entire sample. In a second step, we can 
estimate B. We summarize the procedure, which has recently been dubbed the Heckit method in 
econometrics literature after the work of Heckman (1976). 


Sample Selection Correction: 

(i) Using all n observations, estimate a probit model of s; on z; and obtain the estimates ¥,. 
Compute the inverse Mills ratio, A; = A(z;ĵ) for each i. (Actually, we need these only for the i with 
Ss, = 1.) 

(Gii) Using the selected sample, that is, the observations for which s; = 1 (say, n, of them), run the 
regression of 


y,on x;, Aj. [17.50] 


The Ê, are consistent and approximately normally distributed. 

A simple test of selection bias is available from regression (17.50). Namely, we can use the usual 
t statistic on Î; as a test of Hy: p = 0. Under Hp, there is no sample selection problem. 

When p # 0, the usual OLS standard errors reported from (17.50) are not correct. This is because 
they do not account for estimation of y, which uses the same observations in regression (17.50), and 
more. Some econometrics packages compute corrected standard errors. [Unfortunately, it is not as 
simple as a heteroskedasticity adjustment. See Wooldridge (2010, Chapter 6) for further discussion. ] 
In many cases, the adjustments do not lead to important differences, but it is hard to know that before- 
hand (unless p is small and insignificant). 

We recently mentioned that x should be a strict subset of z. This has two implications. First, any 
element that appears as an explanatory variable in (17.46) should also be an explanatory variable in 
the selection equation. Although in rare cases it makes sense to exclude elements from the selection 
equation, including all elements of x in z is not very costly; excluding them can lead to inconsistency 
if they are incorrectly excluded. 

A second major implication is that we have at least one element of z that is not also in x. This 
means that we need a variable that affects selection but does not have a partial effect on y. This is not 
absolutely necessary to apply the procedure—in fact, we can mechanically carry out the two steps 
when z = x—but the results are usually less than convincing unless we have an exclusion restriction 
in (17.46). The reason for this is that while the inverse Mills ratio is a nonlinear function of z, it is 
often well approximated by a linear function. If z = x, i, can be highly correlated with the elements 
of x;. As we know, such multicollinearity can lead to very high standard errors for the Ê ; Intuitively, if 
we do not have a variable that affects selection but not y, it is extremely difficult, if not impossible, to 
distinguish sample selection from a misspecified functional form in (17.46). 


Wage Offer Equation for Married Women 


We apply the sample selection correction to the data on married women in MROZ. Recall that of 
the 753 women in the sample, 428 worked for a wage during the year. The wage offer equation is 
standard, with log(wage) as the dependent variable and educ, exper, and exper? as the explanatory 
variables. In order to test and correct for sample selection bias—due to unobservability of the wage 
offer for nonworking women—we need to estimate a probit model for labor force participation. In 
addition to the education and experience variables, we include the factors in Table 17.1: other income, 
age, number of young children, and number of older children. The fact that these four variables are 
excluded from the wage offer equation is an assumption: we assume that, given the productivity fac- 
tors, nwifeinc, age, kidslt6, and kidsge6 have no effect on the wage offer. It is clear from the probit 
results in Table 17.1 that at least age and kids/t6 have a strong effect on labor force participation. 
Table 17.7 contains the results from OLS and Heckit. [The standard errors reported for the Heckit 
results are just the usual OLS standard errors from regression (17.50).] There is no evidence of a 
sample selection problem in estimating the wage offer equation. The coefficient on \has a very small 
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t statistic (.239), so we fail to reject Hp: p = 0. Just as importantly, there are no practically large dif- 
ferences in the estimated slope coefficients in Table 17.7. The estimated returns to education differ by 
only one-tenth of a percentage point. 


TABLE 17.7 Wage Offer Equation for Married Women 


Dependent Variable: log(wage) 
Independent Variables OLS Heckit 
educ 108 109 
(.014) (.016) 
exper 042 044 
(.012) (.016) 
exper? —.00081 —.00086 
(.00039) (.00044) 
constant —.522 —.578 
(.199) (.307) 
À — .032 
(.134) 
Sample size R-squared 428 428 
ASF 157 


An alternative to the preceding two-step estimation method is full MLE. This is more compli- 
cated as it requires obtaining the joint distribution of y and s. It often makes sense to test for sample 
selection using the previous procedure; if there is no evidence of sample selection, there is no reason 
to continue. If we detect sample selection bias, we can either use the two-step estimates or estimate 
the regression and selection equations jointly by MLE. [See Wooldridge (2010, Chapter 19).] 

In Example 17.5, we know more than just whether a woman worked during the year: we know 
how many hours each woman worked. It turns out that we can use this information in an alternative 
sample selection procedure. In place of the inverse Mills ratio Îi we use the Tobit residuals, say, ¥;, 
which are computed as ?; = y; — xB whenever y; > 0. It can be shown that the regression in (17.50) 
with ?; in place of Îi also produces consistent estimates of the £, and the standard ż statistic on 9; is a 
valid test for sample selection bias. This approach has the advantage of using more information, but it 
is less widely applicable. [See Wooldridge (2010, Chapter 19).] 

There are many more topics concerning sample selection. One worth mentioning is models with 
endogenous explanatory variables in addition to possible sample selection bias. Write a model with a 
single endogenous explanatory variable as 


Yı = My. + 2B, + u, [17.51] 


where y; is only observed when s = 1 and y, may only be observed along with yı. An example is 
when y, is the percentage of votes received by an incumbent, and y, is the percentage of total expen- 
ditures accounted for by the incumbent. For incumbents who do not run, we cannot observe y; or y2. 
If we have exogenous factors that affect the decision to run and that are correlated with campaign 
expenditures, we can consistently estimate a, and the elements of B, by instrumental variables. To be 
convincing, we need two exogenous variables that do not appear in (17.51). Effectively, one should 
affect the selection decision, and one should be correlated with y, [the usual requirement for estimat- 
ing (17.51) by 2SLS]. Briefly, the method is to estimate the selection equation by probit, where all 
exogenous variables appear in the probit equation. Then, we add the inverse Mills ratio to (17.51) 
and estimate the equation by 2SLS. The inverse Mills ratio acts as its own instrument, as it depends 
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only on exogenous variables. We use all exogenous variables as the other instruments. As before, we 
can use the f statistic on À; as a test for selection bias. [See Wooldridge (2010, Chapter 19) for further 
information. ] 


Summary 


In this chapter, we have covered several advanced methods that are often used in applications, especially in 
microeconomics. Logit and probit models are used for binary response variables. These models have some 
advantages over the linear probability model: fitted probabilities are between zero and one, and the partial 
effects diminish. The primary cost to logit and probit is that they are harder to interpret. 

The Tobit model is applicable to nonnegative outcomes that pile up at zero but also take on a broad 
range of positive values. Many individual choice variables, such as labor supply, amount of life insurance, 
and amount of pension fund invested in stocks, have this feature. As with logit and probit, the expected val- 
ues of y given x—either conditional on y > 0 or unconditionally—depend on x and $ in nonlinear ways. 
We gave the expressions for these expectations as well as formulas for the partial effects of each x; on the 
expectations. These can be estimated after the Tobit model has been estimated by maximum likelihood. 

When the dependent variable is a count variable—that is, it takes on nonnegative, integer values—a 
Poisson regression model is appropriate. The expected value of y given the x; has an exponential form. This 
gives the parameter interpretations as semi-elasticities or elasticities, depending on whether x; is in level or 
logarithmic form. In short, we can interpret the parameters as if they are in a linear model with log(y) as 
the dependent variable. The parameters can be estimated by MLE. However, because the Poisson distribu- 
tion imposes equality of the variance and mean, it is often necessary to compute standard errors and test 
statistics that allow for over- or underdispersion. These are simple adjustments to the usual MLE standard 
errors and statistics. 

Censored and truncated regression models handle specific kinds of missing data problems. In cen- 
sored regression, the dependent variable is censored above or below a threshold. We can use information on 
the censored outcomes because we always observe the explanatory variables, as in duration applications or 
top coding of observations. A truncated regression model arises when a part of the population is excluded 
entirely: we observe no information on units that are not covered by the sampling scheme. This is a special 
case of a sample selection problem. 

Section 17-5 gave a systematic treatment of nonrandom sample selection. We showed that exogenous 
sample selection does not affect consistency of OLS when it is applied to the subsample, but endogenous 
sample selection does. We showed how to test and correct for sample selection bias for the general problem 
of incidental truncation, where observations are missing on y due to the outcome of another variable (such 
as labor force participation). Heckman’s method is relatively easy to implement in these situations. 


Key Terms 


Average Marginal Effect (AME) Exogenous Sample Selection Maximum Likelihood Estimation 
Average Partial Effect (APE) Heckit Method (MLE) 
Binary Response Models Incidental Truncation Nonrandom Sample Selection 
Censored Normal Regression Inverse Mills Ratio Overdispersion 

Model Latent Variable Model Partial Effect at the Average 
Censored Regression Model Likelihood Ratio Statistic (PEA) 
Corner Solution Response Limited Dependent Variable (LDV) Percent Correctly Predicted 
Count Variable Logit Model Poisson Distribution 


Duration Analysis Log-Likelihood Function Poisson Regression Model 
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Probit Model Response Probability Truncated Regression Model 
Pseudo R-Squared Selected Sample Underdispersion 
Quasi-Likelihood Ratio Tobit Model Wald Statistic 

Statistic Top Coding 
Quasi-Maximum Likelihood Truncated Normal Regression 

Estimation (QMLE) Model 


Problems 


1 G) Fora binary response y, let y be the proportion of ones in the sample (which is equal to the 


sample average of the y,). Let qo be the percent correctly predicted for the outcome y = 0 and let 
q, be the percent correctly predicted for the outcome y = 1. If p is the overall percent correctly 
predicted, show that Ĥ is a weighted average of g and q: 


B= (1 — y)G + Ya. 
(ii) In a sample of 300, suppose that y = .70, so that there are 210 outcomes with y; = 1 and 90 


with y; = 0. Suppose that the percent correctly predicted when y = 0 is 80, and the percent 
correctly predicted when y = 1 is 40. Find the overall percent correctly predicted. 


Let grad be a dummy variable for whether a student-athlete at a large university graduates in five 
years. Let hsGPA and SAT be high school grade point average and SAT score, respectively. Let 
study be the number of hours spent per week in an organized study hall. Suppose that, using data on 
420 student-athletes, the following logit model is obtained: 


P( grad = 1|hsGPA,SAT,study) = A(—1.17 + .24 hsGPA + .00058 SAT + .073 study), 


where A(z) = exp(z)/[1 + exp(z)] is the logit function. Holding hsGPA fixed at 3.0 and SAT fixed 
at 1,200, compute the estimated difference in the graduation probability for someone who spent 
10 hours per week in study hall and someone who spent 5 hours per week. 


(Requires calculus) 
(i) | Suppose in the Tobit model that x, = log(z,), and this is the only place zı appears in x. 
Show that 


aE (yly > 0x) _ (Bz) {1 — A(xB/o)[xB/o + A(xB/o) J}, [17.52] 


Oz, 


where 6; is the coefficient on log(z,). 
(ii) Ifx, = z, and x, = zj, show that 


dE(yly > 0.x) 
Oz, 


= (Bı H 2Bxzı){1 F \(xB/o)[xB/o + \(xB/o) J}, 


where 6; is the coefficient on z; and $, is the coefficient on z4. 


Let mvp; be the marginal value product for worker i, which is the price of a firm’s good multiplied by 
the marginal product of the worker. Assume that 


log(mvp;) = Bo + Bixa Hie + Bexig + Uj 
wage; = max(mvp;,minwage;), 


where the explanatory variables include education, experience, and so on, and minwage; is the mini- 
mum wage relevant for person i. Write log(wage;) in terms of log(mvp;) and log(minwage;). 
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5 (Requires calculus) Let patents be the number of patents applied for by a firm during a given year. 
Assume that the conditional expectation of patents given sales and RD is 


E(patents|sales,RD) = exp[By + B,log(sales) + BRD + B,RD*, 


where sales is annual firm sales and RD is total spending on research and development over the past 
10 years. 

(i) | How would you estimate the 6;? Justify your answer by discussing the nature of patents. 

Gi) How do you interpret B,? 

Gii) Find the partial effect of RD on E(patents|sales,RD). 


6 Consider a family saving function for the population of all families in the United States: 
sav = By + B,inc + Bohhsize + Bzeduc + Byage + u, 


where hhsize is household size, educ is years of education of the household head, and age is age of the 

household head. Assume that E(u/inc,hhsize,educ,age) = 0. 

(i) | Suppose that the sample includes only families whose head is over 25 years old. If we use OLS 
on such a sample, do we get unbiased estimators of the 6;? Explain. 

(ii) Now, suppose our sample includes only married couples without children. Can we estimate all 
of the parameters in the saving equation? Which ones can we estimate? 

(iii) Suppose we exclude from our sample families that save more than $25,000 per year. Does OLS 
produce consistent estimators of the £;? 


7 Suppose you are hired by a university to study the factors that determine whether students admitted to 
the university actually come to the university. You are given a large random sample of students who 
were admitted the previous year. You have information on whether each student chose to attend, high 
school performance, family income, financial aid offered, race, and geographic variables. Someone 
says to you, “Any analysis of that data will lead to biased results because it is not a random sample 
of all college applicants, but only those who apply to this university.” What do you think of this 
criticism? 


8 Consider the program evaluation problem where w is the binary treatment indicator and the potential 
outcomes are y(0) and y(1), as we discussed in Sections 3-7e, 4-7, 7-6a, and elsewhere. For a set of 
control variables x), x5, . . . , X,, define conditional mean functions for the two counterfactuals: 


m(x) = ELy(0)|x] 

m,(x) = ELy(1)Ix]. 
If y(O) and y(1) are binary, these are the response probabilities. If we wish to move beyond a linear 
probability model, we would likely use a logit or probit model. If the outcome is a count variable, we 
would likely use exponential models. We can also use the means from a Tobit if y(0), y(1) are corner 


solutions. 
Recall that the average treatment effect is 


Tate = M1 — Mo = Ely(1)] — ELy(0)]. 


(i) Explain why 
Tate T E[m,(x)] E[m(x)], 
where both expectations are necessarily over the distribution of x. [Hint: Use the law of iterated 
expectations described in Math Refresher B.] 
(ii) If you knew the functions m(x) and m,(x), and you have a random sample 
{x;i= 1,...,n}, explain why 


Tie = n-' X[m (x) = m(x;)] 


would be an unbiased estimator of T 


ate* 
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(iii) Now consider the case where we have models, (x, 0o) and m,(x, 01), where @) and 0, are 
parameters and we have estimators 0) and 0,. Now how would you estimate Tre? 
(iv) Define the observed response as usual: 


y = (1 — w)y(0) + wy(1). 


Show that if w is independent of [y(0), y(1)] conditional on x, then 


E(y 


w, x) = (1 — w)mo(x) + wm,(x). 
Therefore, 
E(y|w = 0, x) = m(x) 
E(y|w = 1, x) = m,(x). 
(v) If y(O) and y(1) follow logit or probit models with response probabilities 
G(ao + xBy) 
G(a, + xB,), 
respectively, how would you estimate all of the parameters? How would you estimate T „re? 


Computer Exercises 


C1 Use the data in PNTSPRD for this exercise. 
(i) The variable favwin is a binary variable if the team favored by the Las Vegas point spread wins. 
A linear probability model to estimate the probability that the favored team wins is 


P(favwin = 1|spread) = By + B,spread. 


Explain why, if the spread incorporates all relevant information, we expect By = .5. 

(ii) Estimate the model from part (i) by OLS. Test Hy: By = .5 against a two-sided alternative. Use 
both the usual and heteroskedasticity-robust standard errors. 

(iii) Is spread statistically significant? What is the estimated probability that the favored team wins 
when spread = 10? 

(iv) Now, estimate a probit model for P(favwin = 1|spread). Interpret and test the null hypothesis 
that the intercept is zero. [Hint: Remember that ®(0) = .5.] 

(v) Use the probit model to estimate the probability that the favored team wins when spread = 10. 
Compare this with the LPM estimate from part (iii). 

(vi) Add the variables favhome, fav25, and und25 to the probit model and test joint significance of 
these variables using the likelihood ratio test. (How many df are in the chi-square distribution?) 
Interpret this result, focusing on the question of whether the spread incorporates all observable 
information prior to a game. 


C2 Use the data in LOANAPP for this exercise; see also Computer Exercise C8 in Chapter 7. 

(i) Estimate a probit model of approve on white. Find the estimated probability of loan approval for 
both whites and nonwhites. How do these compare with the linear probability estimates? 

(ii) Now, add the variables hrat, obrat, loanprc, unem, male, married, dep, sch, cosign, chist, 
pubrec, mortlatl, mortlat2, and vr to the probit model. Is there statistically significant evidence 
of discrimination against nonwhites? 

(iii) Estimate the model from part (ii) by logit. Compare the coefficient on white to the probit 
estimate. 

(iv) Use equation (17.17) to estimate the sizes of the discrimination effects for probit and logit. 


C3 Use the data in FRINGE for this exercise. 
(i) For what percentage of the workers in the sample is pension equal to zero? What is the range 
of pension for workers with nonzero pension benefits? Why is a Tobit model appropriate for 
modeling pension? 


C4 


C5 


C6 


C7 


c8 


(ii) 


(iii) 


(iv) 
(v) 
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Estimate a Tobit model explaining pension in terms of exper, age, tenure, educ, depends, 
married, white, and male. Do whites and males have statistically significant higher expected 
pension benefits? 

Use the results from part (ii) to estimate the difference in expected pension benefits for a white 
male and a nonwhite female, both of whom are 35 years old, are single with no dependents, 
have 16 years of education, and have 10 years of experience. 

Add union to the Tobit model and comment on its significance. 

Apply the Tobit model from part (iv) but with peratio, the pension-earnings ratio, as the 
dependent variable. (Notice that this is a fraction between zero and one, but, though it often 
takes on the value zero, it never gets close to being unity. Thus, a Tobit model is fine as an 
approximation.) Does gender or race have an effect on the pension-earnings ratio? 


In Example 9.1, we added the quadratic terms pen, ptime86’, and inc86 to a linear model for narr86. 


(i) 
(ii) 


(iii) 


Use the data in CRIME] to add these same terms to the Poisson regression in Example 17.3. 
Compute the estimate of o° given by 6? = (n — k — 1)7'!>}_,@7/5;,. Is there evidence of 
overdispersion? How should the Poisson MLE standard errors be adjusted? 

Use the results from parts (i) and (ii) and Table 17.5 to compute the quasi-likelihood ratio 


statistic for joint significance of the three quadratic terms. What do you conclude? 


Refer to Table 13.1 in Chapter 13. There, we used the data in FERTILI to estimate a linear model for 
kids, the number of children ever born to a woman. 


G) 
(ii) 


(iii) 
(iv) 


Estimate a Poisson regression model for kids, using the same variables in Table 13.1. Interpret 
the coefficient on y82. 

What is the estimated percentage difference in fertility between a black woman and a nonblack 
woman, holding other factors fixed? 

Obtain G. Is there evidence of over- or underdispersion? 

Compute the fitted values from the Poisson regression and obtain the R-squared as the squared 
correlation between kids; and Kids,. Compare this with the R-squared for the linear regression 
model. 


Use the data in RECID to estimate the model from Example 17.4 by OLS, using only the 552 uncen- 
sored durations. Comment generally on how these estimates compare with those in Table 17.6. 


Use the MROZ data for this exercise. 


(i) 


(ii) 


(iii) 


Using the 428 women who were in the workforce, estimate the return to education by OLS 
including exper, exper’, nwifeinc, age, kidslt6, and kidsge6 as explanatory variables. Report 
your estimate on educ and its standard error. 

Now, estimate the return to education by Heckit, where all exogenous variables show up in the 
second-stage regression. In other words, the regression is log(wage) on educ, exper, exper’, 
nwifeinc, age, kidsIt6, kidsge6, and Â. Compare the estimated return to education and its 
standard error to that from part (i). 

Using only the 428 observations for working women, regress Â on educ, exper, exper, nwifeinc, 
age, kidslt6, and kidsge6. How big is the R-squared? How does this help explain your findings 
from part (ii)? (Hint: Think multicollinearity.) 


The file JTRAIN2 contains data on a job training experiment for a group of men. Men could enter the 
program starting in January 1976 through about mid-1977. The program ended in December 1977. The 
idea is to test whether participation in the job training program had an effect on unemployment prob- 
abilities and earnings in 1978. 


(i) 


Gi) 


The variable train is the job training indicator. How many men in the sample participated in the 
job training program? What was the highest number of months a man actually participated in 
the program? 

Run a linear regression of train on several demographic and pretraining variables: unem74, 
unem75, age, educ, black, hisp, and married. Are these variables jointly significant at the 5% level? 
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c9 


C10 


(iii) 
(iv) 


(v) 


(vi) 


(vii) 


Estimate a probit version of the linear model in part (ii). Compute the likelihood ratio test for 
joint significance of all variables. What do you conclude? 

Based on your answers to parts (ii) and (iii), does it appear that participation in job training can 
be treated as exogenous for explaining 1978 unemployment status? Explain. 

Run a simple regression of unem78 on train and report the results in equation form. What is 
the estimated effect of participating in the job training program on the probability of being 
unemployed in 1978? Is it statistically significant? 

Run a probit of unem78 on train. Does it make sense to compare the probit coefficient on train 
with the coefficient obtained from the linear model in part (v)? 

Find the fitted probabilities from parts (v) and (vi). Explain why they are identical. Which 
approach would you use to measure the effect and statistical significance of the job training 
program? 


(viii) Add all of the variables from part (ii) as additional controls to the models from parts (v) 


(ix) 


and (vi). Are the fitted probabilities now identical? What is the correlation between them? 
Using the model from part (viii), estimate the average partial effect of train on the 1978 
unemployment probability. Use (17.17) with cą = 0. How does the estimate compare with the 
OLS estimate from part (viii)? 


Use the data in APPLE for this exercise. These are telephone survey data attempting to elicit the 


demand for a (fictional) “ecologically friendly” apple. Each family was (randomly) presented with a 
set of prices for regular apples and the ecolabeled apples. They were asked how many pounds of each 
kind of apple they would buy. 


(i) 

(ii) 
(iii) 
(iv) 


(v) 
(vi) 


(vii) 


Of the 660 families in the sample, how many report wanting none of the ecolabeled apples at 
the set price? 

Does the variable ecolbs seem to have a continuous distribution over strictly positive values? 
What implications does your answer have for the suitability of a Tobit model for ecolbs? 
Estimate a Tobit model for ecolbs with ecoprc, regprc, faminc, and hhsize as explanatory 
variables. Which variables are significant at the 1% level? 

Are faminc and hhsize jointly significant? 

Are the signs of the coefficients on the price variables from part (iii) what you expect? Explain. 
Let 8, be the coefficient on ecoprc and let B, be the coefficient on regprc. Test the hypothesis 
Hy: —B, = b against the two-sided alternative. Report the p-value of the test. (You might want 
to refer to Section 4-4 if your regression package does not easily compute such tests.) 

Obtain the estimates of E(ecolbs|x) for all observations in the sample. [See equation (17.25).] 
Call these ecolbs,. What are the smallest and largest fitted values? 


(viii) Compute the squared correlation between ecolbs, and ecolbs,. 


(ix) 


(x) 


Now, estimate a linear model for ecolbs using the same explanatory variables from part (iii). 
Why are the OLS estimates so much smaller than the Tobit estimates? In terms of goodness-of- 
fit, is the Tobit model better than the linear model? 

Evaluate the following statement: “Because the R-squared from the Tobit model is so small, the 
estimated price effects are probably inconsistent.” 


Use the data in SMOKE for this exercise. 


G) 


(ii) 


(iii) 


(iv) 


The variable cigs is the number of cigarettes smoked per day. How many people in the sample 
do not smoke at all? What fraction of people claim to smoke 20 cigarettes a day? Why do you 
think there is a pileup of people at 20 cigarettes? 

Given your answers to part (i), does cigs seem a good candidate for having a conditional 
Poisson distribution? 

Estimate a Poisson regression model for cigs, including log(cigpric), log(income), white, educ, 
age, and age’ as explanatory variables. What are the estimated price and income elasticities? 
Using the maximum likelihood standard errors, are the price and income variables statistically 
significant at the 5% level? 


C11 
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(v) Obtain the estimate of g? described after equation (17.35). What is &? How should you adjust 
the standard errors from part (iv)? 

(vi) Using the adjusted standard errors from part (v), are the price and income elasticities now 
statistically different from zero? Explain. 

(vii) Are the education and age variables significant using the more robust standard errors? How do 
you interpret the coefficient on educ? 

(viii) Obtain the fitted values, ĵ;, from the Poisson regression model. Find the minimum and 
maximum values and discuss how well the exponential model predicts heavy cigarette 
smoking. 

(ix) Using the fitted values from part (viii), obtain the squared correlation coefficient between ĵ; 
and y,. 

(x) Estimate a linear model for cigs by OLS, using the explanatory variables (and same functional 
forms) as in part (iii). Does the linear model or exponential model provide a better fit? Is either 
R-squared very large? 


Use the data in CPS91 for this exercise. These data are for married women, where we also have infor- 
mation on each husband’s income and demographics. 

(i) | What fraction of the women report being in the labor force? 

(ii) Using only the data for working women—you have no choice—estimate the wage equation 


log(wage) = By + B,educ + Boexper + Bsexper + Byblack + Bshispanic + u 


by ordinary least squares. Report the results in the usual form. Do there appear to be significant 
wage differences by race and ethnicity? 

(iii) Estimate a probit model for inlf that includes the explanatory variables in the wage equation 
from part (ii) as well as nwifeinc and kidlt6. Do these last two variables have coefficients of the 
expected sign? Are they statistically significant? 

(iv) Explain why, for the purposes of testing and, possibly, correcting the wage equation for 
selection into the workforce, it is important for nwifeinc and kidlt6 to help explain inlf. What 
must you assume about nwifeinc and kidlt6 in the wage equation? 

(v) Compute the inverse Mills ratio (for each observation) and add it as an additional regressor to 
the wage equation from part (ii). What is its two-sided p-value? Do you think this is particularly 
small with 3,286 observations? 

(vi) Does adding the inverse Mills ratio change the coefficients in the wage regression in important 
ways? Explain. 


Use the data in CHARITY to answer these questions. 

(Gi) The variable respond is a binary variable equal to one if an individual responded with a 
donation to the most recent request. The database consists only of people who have responded 
at least once in the past. What fraction of people responded most recently? 

(ii) Estimate a probit model for respond, using resplast, weekslast, propresp, mailsyear, and avggift 
as explanatory variables. Which of the explanatory variables is statistically significant? 

(iii) Find the average partial effect for mailsyear and compare it with the coefficient from a linear 
probability model. 

(iv) Using the same explanatory variables, estimate a Tobit model for gift, the amount of the most 
recent gift (in Dutch guilders). Now which explanatory variable is statistically significant? 

(v) Compare the Tobit APE for mailsyear with that from a linear regression. Are they similar? 

(vi) Are the estimates from parts (ii) and (iv) entirely compatible with at Tobit model? Explain. 


Use the data in HTV to answer these questions. 

(i) | Using OLS on the full sample, estimate a model for log(wage) using explanatory variables educ, 
abil, exper, nc, west, south, and urban. Report the estimated return to education and its standard 
error. 
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C14 


C15 


(ii) 


(iii) 


(iv) 


Now estimate the equation from part (i) using only people with educ < 16. What percentage 

of the sample is lost? Now what is the estimated return to a year of schooling? How does it 
compare with part (i)? 

Now drop all observations with wage = 20, so that everyone remaining in the sample earns 

less than $20 an hour. Run the regression from part (i) and comment on the coefficient on educ. 
(Because the normal truncated regression model assumes that y is continuous, it does not matter in 
theory whether we drop observations with wage = 20 or wage > 20. In practice, including in this 
application, it can matter slightly because there are some people who earn exactly $20 per hour.) 
Using the sample in part (iii), apply truncated regression [with the upper truncation point 

being log(20)]. Does truncated regression appear to recover the return to education in the full 
population, assuming the estimate from (i) is consistent? Explain. 


Use the data in HAPPINESS for this question. See also Computer Exercise C15 in Chapter 13. 


G) 


(ii) 


(iii) 


(iv) 


Estimate a probit probability model relating vhappy to occattend and regattend, and include a 
full set of year dummies. Find the average partial effects for occattend and regattend. How do 
these compare with those from estimating a linear probability model? 

Define a variable, highinc, equal to one if family income is above $25,000. Include highinc, 
unem10, educ, and teens to the probit estimation in part (ii). Is the APE of regattend affected 
much? What about its statistical significance? 

Discuss the APEs and statistical significance of the four new variables in part (ii). Do the 
estimates make sense? 

Controlling for the factors in part (ii), do there appear to be differences in happiness by gender 
or race? Justify your answer. 


Use the data set in ALCOHOL, obtained from Terza (2002), to answer this question. The data, on 
9,822 men, includes labor market information, whether the man abuses alcohol, and demographic and 
background variables. In this question you will study the effects of alcohol abuse on employ, which is 
a binary variable equal to one if the man has a job. If employ = 0, the man is either unemployed or not 
in the workforce. 


(i) 


(ii) 


(iii) 


(iv) 


(v) 


(vi) 


(vii) 


What fraction of the sample is employed at the time of the interview? What fraction of the 
sample has abused alcohol? 

Run the simple regression of employ on abuse and report the results in the usual form, 
obtaining the heteroskedasticity-robust standard errors. Interpret the estimated equation. Is the 
relationship as you expected? Is it statistically significant? 

Run a probit of employ on abuse. Do you get the same sign and statistical significance as in 
part (ii)? How does the average partial effect for the probit compare with that for the linear 
probability model? 

Obtain the fitted values for the LPM estimated in part (ii) and report what they are when 

abuse = 0 and when abuse = 1. How do these compare to the probit fitted values, and why? 
To the LPM in part (ii) add the variables age, agesq, educ, educsq, married, famsize, white, 
northeast, midwest, south, centcity, outercity, grt, qrt2, and grt3. What happens to the 
coefficient on abuse and its statistical significance? 

Estimate a probit model using the variables in part (v). Find the APE of abuse and its t statistic. 
Is the estimated effect now identical to that for the linear model? Is it “close”? 

Variables indicating the overall health of each man are also included in the data set. Is it obvious 
that such variables should be included as controls? Explain. 


(viii) Why might abuse be properly thought of as endogenous in the employ equation? Do you think 


(ix) 


(x) 


the variables mothalc and fathalc, indicating whether a man’s mother or father were alcoholics, 
are sensible instrumental variables for abuse? 

Estimate the LPM underlying part (v) by 2SLS, where mothalc and fathalc act as IVs for abuse. 
Is the difference between the 2SLS and OLS coefficients practically large? 

Use the test described in Section 15-5 to test whether abuse is endogenous in the LPM. 


C16 
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Use the data in CRIME] to answer this question. 

(i) For the OLS estimates reported in Table 17.5, find the heteroskedasticity-robust standard errors. 
In terms of statistical significance of the coefficients, are there any notable changes? 

(i) Obtain the fully robust standard errors—that is, those that do not even require assumption 
(17.35)—for the Poisson regression estimates in the second column. (This requires that 
you have a statistical package that computes the fully robust standard errors.) Compare the 
fully robust 95% confidence interval for Bpen with that obtained using the standard error in 
Table 17.5. 

(iii) Compute the average partial effects for each variable in the Poisson regression model. Use the 
formula for binary explanatory variables for black, hispan, and born60. Compare the APEs for 
gemp86 and inc86 with the corresponding OLS coefficients. 

(iv) If your statistical package reports the robust standard errors for the APEs in part (iii), compare 
the robust ¢ statistic for the OLS estimate of Bpen with the robust ¢ statistic for the APE of penv 
in the Poisson regression. 


Use the data in JTRAIN98 to answer the following questions. See also Examples 3.7, 4.11, and 7.13 
for linear model analysis. Here you will use a Tobit model because the outcome, earn98, sometimes is 
zero. 

(i) | How many observations (men) in the sample have earn98 = 0? Is it a large percentage of the 
sample? 

(ii) Estimate a Tobit model for earn98S, using train, earn96, educ, and married as the explanatory 
variables. Report the Barain and its standard error. Is the sign what you expect? How statistically 
significant is it? 

(iii) Does it make sense to compare the magnitude of the Tobit coefficient Barain with the OLS 
coefficient from running a linear regression, say Yjrain? Explain. 

(iv) In part (ii), obtain the average partial effect—which is the average treatment effect—of train, 
and obtain its standard error. (Many econometrics packages have built-in commands to do 
this calculation.) How does it compare with the OLS coefficient, Virgin? What about statistical 
significance? 

(v) To the Tobit estimation, include a full set of interactions of train with earn96, educ, and 
married. Now compute the APE (ATE) of train, along with a standard error. How does it 
compare with the linear model with full interactions in Example 7.13? [Incidentally, for getting 
the ATE it does not help to demean earn96, educ, and married before creating the interactions 
because of the nonlinear conditional mean function. Nevertheless, it will make the coefficients 
more comparable to those in the Tobit model without the interactions. ] 

(vi) Explain why the estimation in part (v) is not the same as estimating two separate Tobit models 
for the control and treatment groups and then obtaining the ATE as 


>| ñ (x;) E fito(x;) |, 


i=1 


where ñy (+) is the estimated mean function using the control (no training) group, ñ (+) is the 
estimated mean function using the treated (training) group, and x; = (earn96,, educ;, married,). 
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APPENDIX 17A 
17A.1 Maximum Likelihood Estimation with Explanatory Variables 


Math Refresher C provides a review of maximum likelihood estimation (MLE) in the simplest case 
of estimating the parameters in an unconditional distribution. But most models in econometrics have 
explanatory variables, whether we estimate those models by OLS or MLE. The latter is indispens- 
able for nonlinear models, and here we provide a very brief description of the general approach. 

All of the models covered in this chapter can be put in the following form. Let f(y|x,B) denote 
the density function for a random draw y, from the population, conditional on x; = x. The maximum 
likelihood estimator (MLE) of 6 maximizes the log-likelihood function, 


max > log yix; b), [17.53] 
i=1 


where the vector b is the dummy argument in the maximization problem. In most cases, the MLE, 
which we write as Ê, is consistent and has an approximate normal distribution in large samples. This 
is true even though we cannot write down a formula for Ê except in very special circumstances. 

For the binary response case (logit and probit), the conditional density is determined by two 
values, f(1|x,B) = P(y; = 1)x;) = G(x;ß) and f(0|x,B) = P(y; = O|x;) = 1 — G(x;ß). In fact, a 
succinct way to write the density is f(y|x,8) = [1 — G(xB) U [G(xß)} for y = 0, 1. Thus, we 
can write (17.53) as 


max {(1 — y,)logl1 — G(xb)] + yloglG(xb) ]}. [17.54] 
i=l 
Generally, the solutions to (17.54) are quickly found by modern computers using iterative meth- 
ods to maximize a function. The total computation time even for fairly large data sets is typically 
quite low. 
The log-likelihood function for the Tobit model and for censored and truncated regression are 
only slightly more complicated, depending on an additional variance parameter in addition to B. 
They are easily derived from the densities obtained in the text. See Wooldridge (2010) for details. 


APPENDIX 17B 
17B.1 Asymptotic Standard Errors in Limited Dependent Variable Models 


Derivations of the asymptotic standard errors for the models and methods introduced in this chapter 
are well beyond the scope of this text. Not only do the derivations require matrix algebra, but they 
also require advanced asymptotic theory of nonlinear estimation. The background needed for a care- 
ful analysis of these methods and several derivations are given in Wooldridge (2010). 

It is instructive to see the formulas for obtaining the asymptotic standard errors for at least some 
of the methods. Given the binary response model P(y = 1|x) = G(xB), where G(-) is the logit 
or probit function, and $ is the k X 1 vector of parameters, the asymptotic variance matrix of Bis 
estimated as 


= [g(xiB) Px/x; ee 
) [17.55] 


Avai(B) = (S - 

i=1G(x,B)[1 = G(x,B) 
which is ak X k matrix. (See Advanced Treatment D for a summary of matrix algebra.) Without 
the terms involving g(-) and G(-), this formula looks a lot like the estimated variance matrix for the 
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OLS estimator, minus the term ô’. The expression in (17.55) accounts for the nonlinear nature of the 
response probability—that is, the nonlinear nature of G(-)—as well as the particular form of hetero- 
skedasticity in a binary response model: Var(y|x) = G(xB)[1 — G(xB)]. 

The square roots of the diagonal elements of (17.55) are the asymptotic standard errors of the 
Bi and they are routinely reported by econometrics software that supports logit and probit analysis. 
Once we have these, (asymptotic) f statistics and confidence intervals are obtained in the usual ways. 

The matrix in (17.55) is also the basis for Wald tests of multiple restrictions on B [see 
Wooldridge (2010, Chapter 15)]. 

The asymptotic variance matrix for Tobit is more complicated but has a similar structure. Note 
that we can obtain a standard error for & as well. The asymptotic variance for Poisson regression, 
allowing for ø? # 1 in (17.35), has a form much like (17.55): 


Avar(B) = #°(SewlxBx) [17.56] 
i=l 


The square roots of the diagonal elements of this matrix are the asymptotic standard errors. If the 
Poisson assumption holds, we can drop ô? from the formula (because a? = 1). 

The formula for the fully robust variance matrix estimator is obtained in Wooldridge (2010, 
Chapter 18): 


=] 


smb) = [Sewan] (Si [Swata] 
i=1 i=l i=l 


where ii; = y; — exp(x,B) are the residuals from the Poisson regression. This expression has a struc- 
ture similar to the heteroskedasticity-robust standard variance matrix estimator for OLS, and it is 
computed routinely by many software packages to obtain the fully robust standard errors. 

Asymptotic standard errors for censored regression, truncated regression, and the Heckit sample 
selection correction are more complicated, although they share features with the previous formulas. 
[See Wooldridge (2010) for details. ] 
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604 


n this chapter, we cover some more advanced topics in time series econometrics. In Chapters 10, 

11, and 12, we emphasized in several places that using time series data in regression analysis 

requires some care due to the trending, persistent nature of many economic time series. In addition 
to studying topics such as infinite distributed lag models and forecasting, we also discuss some recent 
advances in analyzing time series processes with unit roots. 

In Section 18-1, we describe infinite distributed lag models, which allow a change in an explana- 
tory variable to affect all future values of the dependent variable. Conceptually, these models are 
straightforward extensions of the finite distributed lag models in Chapter 10, but estimating these 
models poses some interesting challenges. 

In Section 18-2, we show how to formally test for unit roots in a time series process. Recall from 
Chapter 11 that we excluded unit root processes to apply the usual asymptotic theory. Because the 
presence of a unit root implies that a shock today has a long-lasting impact, determining whether a 
process has a unit root is of interest in its own right. 

We cover the notion of spurious regression between two time series processes, each of which has 
a unit root, in Section 18-3. The main result is that even if two unit root series are independent, it is 
quite likely that the regression of one on the other will yield a statistically significant f statistic. This 
emphasizes the potentially serious consequences of using standard inference when the dependent and 
independent variables are integrated processes. 

The notion of cointegration applies when two series are I(1), but a linear combination of them 


is I(0); in this case, the regression of one on the other is not spurious, but instead tells us something 
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about the long-run relationship between them. Cointegration between two series also implies a 

particular kind of model, called an error correction model, for the short-term dynamics. We cover 
these models in Section 18-4. 

In Section 18-5, we provide an overview of forecasting and bring together all of the tools in this 

and previous chapters to show how regression methods can be used to forecast future outcomes of a 

time series. The forecasting literature is vast, so we focus only on the most common regression-based 


methods. We also touch on the related topic of Granger causality. 


18-1 Infinite Distributed Lag Models 


Let {(y, z): t=...,—2,—1,0,1,2,...} be a bivariate time series process (which is only 
partially observed). An infinite distributed lag (IDL) model relating y, to current and all past values 
of z is 

y= A+ oz, + ÊZ + ÊZ- $7 + Uy, [18.1] 


where the sum on lagged z extends back to the indefinite past. This model is only an approximation to 
reality, as no economic process started infinitely far into the past. Compared with a finite distributed 
lag model, an IDL model does not require that we truncate the lag at a particular value. 

In order for model (18.1) to make sense, the lag coefficients, 6;, must tend to zero as j > ~. This 
is not to say that ô, is smaller in magnitude than 6,; it only means that the impact of z,_; on y, must 
eventually become small as j gets large. In most applications, this makes economic sense as well: the 
distant past of z should be less important for explaining y than the recent past of z. 

Even if we decide that (18.1) is a useful model, we clearly cannot estimate it without some 
restrictions. For one, we only observe a finite history of data. Equation (18.1) involves an infinite 
number of parameters, ôo, Ôi, 5, . . . , which cannot be estimated without restrictions. Later, we place 
restrictions on the 6; that allow us to estimate (18.1). 

As with finite distributed lag (FDL) models, the impact propensity in (18.1) is simply 69 (see 
Chapter 10). Generally, the ô, have the same interpretation as in an FDL. Suppose that z, = 0 for all 
s < 0 and that zọ = 1 and z, = 0 for all s > 1; in other words, at time t = 0, z increases temporarily 
by one unit and then reverts to its initial level of zero. For any h = 0, we have y, = a + 6, + u, for 
all h = 0, and so 


E(y,) = a+ Ôn, [18.2] 


where we use the standard assumption that u, has zero mean. It follows that 6, is the change in E(y,) 
given a one-unit, temporary change in z at time zero. We just said that 6, must be tending to zero as 
h gets large for the IDL to make sense. This means that a temporary change in z has no long-run effect 
on expected y: E(y,) = a + 6,> a as h > œ. 

We assumed that the process z starts at z, = 0 and that the one-unit increase occurred at t = 0. 
These were only for the purpose of illustration. More generally, if z temporarily increases by one unit 
(from any initial level) at time ¢, then ô, measures the change in the expected value of y after h peri- 
ods. The lag distribution, which is 6, plotted as a function of h, shows the expected path that future 
outcomes on y follow given the one-unit, temporary increase in z. 

The long-run propensity in model (18.1) is the sum of all of the lag coefficients: 


LRP = 8) + 6, + 8) + 8; ++" [18.3] 


where we assume that the infinite sum is well defined. Because the 6; must converge to zero, the 
LRP can often be well approximated by a finite sum of the form 6) + ô; + =- + 6, for sufficiently 
large p. To interpret the LRP, suppose that the process z, is steady at z, = 0 for s < 0. Att = 0, the 


606 PART3 Advanced Topics 


process permanently increases by one unit. For example, if z, is the percentage change in the money 
supply and y, is the inflation rate, then we are interested in the effects of a permanent increase of one 
percentage point in money supply growth. Then, by substituting z, = 0 for s < O and z, = 1 fort = 0, 
we have 


y, =æ + ôo t 6, +--+ 6, + uy, 
where h = 0 is any horizon. Because u, has a zero mean for all t, we have 
E(y,) = a + ôo + 6; +++: + ôn [18.4] 


[It is useful to compare (18.4) and (18.2).] As the horizon increases, that is, as h > œ, the right-hand 
side of (18.4) is, by definition, the long-run propensity, plus a. Thus, the LRP measures the long-run 
change in the expected value of y given a one-unit, 
GOING FURTHER 18.1 permanent increase in z. 
Suppose that z, = 0 for s <0 and that | . The previous derivation of the LRP and the 
2 =1,z,=1, and z, = 0 for s > 1. Find interpretation of ô; used the fact that the errors have 
E(y_,), E(yo), and E(y,) for h = 1. What | azero mean; as usual, this is not much of an assump- 
happens as h > ©? tion, provided an intercept is included in the model. 
A closer examination of our reasoning shows that 
we assumed that the change in z during any time 
period had no effect on the expected value of u,. This is the infinite distributed lag version of the 
strict exogeneity assumption that we introduced in Chapter 10 (in particular, Assumption TS.3). 
Formally, 


E(u). aa Zt—2> Zt-1> Lp Zt+ 1> 98 a = 0, [18.5] 


so that the expected value of u, does not depend on the z in any time period. Although (18.5) is natu- 
ral for some applications, it rules out other important possibilities. In effect, (18.5) does not allow 
feedback from y, to future z because z,,,, must be uncorrelated with u, for h > 0. In the inflation/ 
money supply growth example, where y, is inflation and z, is money supply growth, (18.5) rules out 
future changes in money supply growth that are tied to changes in today’s inflation rate. Given that 
money supply policy often attempts to keep interest rates and inflation at certain levels, this might be 
unrealistic. 

One approach to estimating the 6;, which we cover in the next subsection, requires a strict exog- 
eneity assumption in order to produce consistent estimators of the 6;. A weaker assumption is 


E(ul2p Za ++.) = 0. [18.6] 


Under (18.6), the error is uncorrelated with current and past z, but it may be correlated with future z; 
this allows z, to be a variable that follows policy rules that depend on past y. Sometimes, (18.6) is 
sufficient to estimate the ô; we explain this in the next subsection. 

One thing to remember is that neither (18.5) nor (18.6) says anything about the serial correlation 
properties of {u,}. (This is just as in finite distributed lag models.) If anything, we might expect the 
{u,} to be serially correlated because (18.1) is not generally dynamically complete in the sense dis- 
cussed in Section 11-4. We will study the serial correlation problem later. 

How do we interpret the lag coefficients and the LRP if (18.6) holds but (18.5) does not? The 
answer is: the same way as before. We can still do the previous thought (or counterfactual) experi- 
ment even though the data we observe are generated by some feedback between y, and future z. For 
example, we can certainly ask about the long-run effect of a permanent increase in money supply 
growth on inflation even though the data on money supply growth cannot be characterized as strictly 
exogenous. 
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18-1a The Geometric (or Koyck) Distributed Lag Model 


Because there are generally an infinite number of ô, we cannot consistently estimate them without 
some restrictions. The simplest version of (18.1), which still makes the model depend on an infinite 
number of lags, is the geometric (or Koyck) distributed lag. In this model, the 6; depend on only 
two parameters: 


& = yp’, |p| < 1, j = 0,1, 2,.... [18.7] 


The parameters y and p may be positive or negative, but p must be less than one in absolute value. 
This ensures that 6; > 0 as j > ~. In fact, this convergence happens at a very fast rate. (For example, 
with p = .5 and j = 10, p’ = 1/1,024 < .001.) 

The impact propensity (IP) in the GDL is simply 6) = y, so the sign of the IP is deter- 
mined by the sign of y. If y > 0, say, and p > 0, then all lag coefficients are positive. If p < 0, 
the lag coefficients alternate in sign (p/ is negative for odd j). The long-run propensity is 
more difficult to obtain, but we can use a standard result on the sum of a geometric series: for 
lol <1,1 +p+ø ++p +- = 1/1 — p), and so 


LRP = y/(1 — p). 


The LRP has the same sign as y. 

If we plug (18.7) into (18.1), we still have a model that depends on the z back to the indefi- 
nite past. Nevertheless, a simple subtraction yields an estimable model. Write the IDL at times ¢ and 
t= 14s 


y,= at Yz + yet) + YPZ- + + u, [18.8] 
and 
Hay 5 Æ + Yz- + YPZ- F pag H + upi [18.9] 
If we multiply the second equation by p and subtract it from the first, all but a few of the terms cancel: 
Yi T PY = (1 = p)a + yz + u, — pui, 
which we can write as 
Yt = Qo + YZ + PYr-1 + Uy — Pli-1, [18.10] 


where a = (1 — p)a. This equation looks like a standard model with a lagged dependent variable, 
where z, appears contemporaneously. Because y is the coefficient on z, and p is the coefficient on y,_,, 
it appears that we can estimate these parameters. [If, for some reason, we are interested in a, we can 
always obtain & = &/(1 — f) after estimating p and ay.] 

The simplicity of (18.10) is somewhat misleading. The error term in this equation, u, — pu,;—, 
is generally correlated with y,_,. From (18.9), it is pretty clear that u,_, and y,_, are correlated. 
Therefore, if we write (18.10) as 


Yi = Qo + YZ + py,-1 + Vp [18.11] 


where v, = u, — pu,_,, then we generally have correlation between v, and y,_,;. Without further 
assumptions, OLS estimation of (18.11) produces inconsistent estimates of y and p. 

One case where v, must be correlated with y,_, occurs when u, is independent of z, and all 
past values of z and y. Then, (18.8) is dynamically complete, so u, is uncorrelated with y,_y. 
From (18.9), the covariance between v, and y,_, is —pVar(u,_;) = —po2, which is zero only 
if p = 0. We can easily see that v, is serially correlated: because {u,} is serially uncorrelated, 
E(v,v,_1) > E(uu,—1) = pE(u;_ 1) E pE(u,u,—>) 9 p°E(u,—u;—2) = — po;,.For j > 1,E(v,v,_;) =0. 
Thus, {v,} is a moving average process of order one (see Section 11-1). This, and equation (18.11), 
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gives an example of a model—which is derived from the original model of interest—that has a lagged 
dependent variable and a particular kind of serial correlation. 

If we make the strict exogeneity assumption (18.5), then z, is uncorrelated with u, and u,_,, and 
therefore with v,. Thus, if we can find a suitable instrumental variable for y,_ ,, then we can estimate 
(18.11) by IV. What is a good IV candidate for y,_ ,;? By assumption, u, and u,_, are both uncorrelated 
with z,_;, So v, is uncorrelated with z,_,. If y # 0, z,_,; and y,_, are correlated, even after partialling 
out z,. Therefore, we can use instruments (z,, z,_;) to estimate (18.11). Generally, the standard errors 
need to be adjusted for serial correlation in the {v,}, as we discussed in Section 15-7. 

An alternative to IV estimation exploits the fact that {u,} may contain a specific kind of serial 
correlation. In particular, in addition to (18.6), suppose that {u,} follows the AR(1) model 


u, = pu, +e, [18.12] 


E(e,lz, Yi- 1> Sad +> .) =0. [18.13] 


It is important to notice that the p appearing in (18.12) is the same parameter multiplying y,_, in 
(18.11). If (18.12) and (18.13) hold, we can write equation (18.10) as 


Yi = Ay + YZ + PYr-1 + en [18.14] 


which is a dynamically complete model under (18.13). From Chapter 11, we can obtain consist- 
ent, asymptotically normal estimators of the parameters by OLS. This is very convenient, as there is 
no need to deal with serial correlation in the errors. If e, satisfies the homoskedasticity assumption 
Var(e,|z,, y;-1) = g7, the usual inference applies. Once we have estimated y and p, we can easily esti- 
mate the LRP: LRP = 4/(1 — ĝ). Many econometrics packages have simple commands that allow 
one to obtain a standard error for the estimated LRP. 

The simplicity of this procedure relies on the potentially strong assumption that {u,} follows an 
AR(1) process with the same p appearing in (18.7). This is usually no worse than assuming the {u,} 
are serially uncorrelated. Nevertheless, because consistency of the estimators relies heavily on this 
assumption, it is a good idea to test it. A simple test begins by specifying {u,} as an AR(1) process 
with a different parameter, say, u, = Au,_; + e, McClain and Wooldridge (1995) devised a simple 
Lagrange multiplier test of Hp: A = p that can be computed after OLS estimation of (18.14). 

The geometric distributed lag model extends to multiple explanatory variables—so that we have 
an infinite DL in each explanatory variable—but then we must be able to write the coefficient on 
Zt—j, h aS y,p’. In other words, though y, is different for each explanatory variable, p is the same. Thus, 
we can write 


Ye = Ag + Viz Peo F Ya F PYi-i T Yp [18.15] 


The same issues that arose in the case with one z arise in the case with many z. Under the natu- 
ral extension of (18.12) and (18.13)—just replace z, with z, = (z,..., Z)—-OLS is consistent and 
asymptotically normal. Or, an IV method can be used. 


18-1b Rational Distributed Lag Models 


The geometric DL implies a fairly restrictive lag distribution. When y > 0 and p > 0, the 6; are 
positive and monotonically declining to zero. It is possible to have more general infinite distributed 
lag models. The GDL is a special case of what is generally called a rational distributed lag (RDL) 
model. A general treatment is beyond our scope—Harvey (1990) is a good reference—but we can 
cover one simple, useful extension. 

Such an RDL model is most easily described by adding a lag of z to equation (18.11): 


Yi = Qo + Voz + PY-1 + ViZ—-1 + Vp [18.16] 
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coefficient .5 


where v, = u, — pu,—, as before. By repeated substitution, it can be shown that (18.16) is equivalent 
to the infinite distributed lag model 


Y= at Yoz + phat pee Fe) 
py gn tpat paca Pe) P 
=a + yoz, + (PYo + Vi)Z—-1 + PlPY + V1)Z-2 
+ P (po + V1)Z1-3 Tit Up 


where we again need the assumption |p| < 1. From this last equation, we can read off the lag distri- 
bution. In particular, the impact propensity is yo, while the coefficient on z,_;, is p" '(pyo + y1) for 
h = 1. Therefore, this model allows the impact propensity to differ in sign from the other lag coef- 
ficients, even if p > 0. However, if p > 0, the ô, have the same sign as (pyọ + yı) for all h = 1. The 
lag distribution is plotted in Figure 18.1 for p = .5, yp = —1, andy, = 1. 

The easiest way to compute the long-run propensity is to set y and z at their long-run values for 
all t, say, y“ and z“, and then find the change in y* with respect to z“ (see also Problem 3 in Chapter 10). 
We have y“ = ay + yoz“ + py” + yız“ and solving gives y* = a/(1 — p) + (vp + yı) — p). 
Now, we use the fact that LRP = Ay*/Az’: 


LRP = (Yo + y,)/(1 = p). 


Because |p| < 1, the LRP has the same sign as yọ + yı, and the LRP is zero if, and only if, 
Yo + yı = 0, as in Figure 18.1. 


EXAMPLE 18.1 Housing Investment and Residential Price Inflation 


We estimate both the basic geometric and the rational distributed lag models by applying OLS to 
(18.14) and (18.16), respectively. The dependent variable is log(invpc) after a linear time trend has 
been removed [that is, we linearly detrend log(invpc)]. For z,, we use the growth in the price index. 
This allows us to estimate how residential price inflation affects movements in housing investment 
around its trend. The results of the estimation, using the data in HSEINV, are given in Table 18.1. 
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TABLE 18.1 Distributed Lag Models for Housing Investment 


Dependent Variable: log(invpc), detrended 
Independent Variables GeometricDL RationalDL 
gprice 3.095 3.256 
(933) (.970) 
V4 .340 547 
(132) (.152) 
gprice_., = —2.936 
(.973) 
constant —.010 .006 
(.018) (017) 
Long-run propensity 4.689 .106 
Sample size 41 40 
Adjusted R-squared 375 504 


The geometric distributed lag model is clearly rejected by the data, as gprice_, is very signifi- 
cant. The adjusted R-squareds also show that the RDL model fits much better. 

The two models give very different estimates of the long-run propensity. If we incorrectly use 
the GDL, the estimated LRP is almost five: a permanent one percentage point increase in residential 
price inflation increases long-term housing investment by 4.7% (above its trend value). Economically, 
this seems implausible. The LRP estimated from the rational distributed lag model is below one. 
In fact, we cannot reject the null hypothesis Hy: Yọ + yı = O at any reasonable significance level 
(p-value = .83), so there is no evidence that the LRP is different from zero. This is a good exam- 
ple of how misspecifying the dynamics of a model by omitting relevant lags can lead to erroneous 
conclusions. 


18-2 Testing for Unit Roots 


We now turn to the important problem of testing whether a time series follows a Unit Roots. In 
Chapter 11, we gave some vague, necessarily informal guidelines to decide whether a series is I(1) or 
not. In many cases, it is useful to have a formal test for a unit root. As we will see, such tests must be 
applied with caution. 

The simplest approach to testing for a unit root begins with an AR(1) model: 


y= art py- +e,¢t = 1,2,..., [18.17] 


where yọ is the observed initial value. Throughout this section, we let {e,} denote a process that has 
zero mean, given past observed y: 


E(ely,—1, Yt- > yo) = 0. [18.18] 


[Under (18.18), {e,} is said to be a martingale difference sequence with respect to {y,_ |, y,—2, . . .}- 
If {e,} is assumed to be i.i.d. with zero mean and is independent of yp, then it also satisfies (18.18).] 
If {y,} follows (18.17), it has a unit root if, and only if, p = 1. If æ = 0 and p = 1, {y,} follows 
a random walk without drift [with the innovations e, satisfying (18.18)]. If a # 0 and p = 1, {y,} isa 
random walk with drift, which means that E(y,) is a linear function of t. A unit root process with drift 
behaves very differently from one without drift. Nevertheless, it is common to leave a unspecified 
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under the null hypothesis, and this is the approach we take. Therefore, the null hypothesis is that {y,} 
has a unit root: 


Hp: p = 1. [18.19] 
In almost all cases, we are interested in the one-sided alternative 
H:p< l1. [18.20] 


(In practice, this means 0 < p < 1, as p < 0 for a series that we suspect has a unit root would be very 
rare.) The alternative H,: p > 1 is not usually considered, because it implies that y, is explosive. In 
fact, if a > 0, y, has an exponential trend in its mean when p > 1. 

When |p| < 1, {y,} is a stable AR(1) process, which means it is weakly dependent or asymp- 
totically uncorrelated. Recall from Chapter 11 that Corr(y,, y,+n) = p* > 0 when |p| < 1. Therefore, 
testing (18.19) in model (18.17), with the alternative given by (18.20), is really a test of whether {y,} 
is I(1) against the alternative that {y,} is I(0). [We do not take the null to be I(0) in this setup because 
{y,} is I(0) for any value of p strictly between —1 and 1, something that classical hypothesis test- 
ing does not handle easily. There are tests where the null hypothesis is I(0) against the alternative 
of I(1), but these take a different approach. See, for example, Kwiatkowski, Phillips, Schmidt, and 
Shin (1992).] 

A convenient equation for carrying out the unit root test is to subtract y,_, from both sides of 
(18.17) and to define 0 = p — 1: 


Ay, = a + Oy,_; + e [18.21] 


Under (18.18), this is a dynamically complete model, and so it seems straightforward to test Hp: 0 = 0 
against H,: 0 < 0. The problem is that, under Hp, y,_ is I(1), and so the usual central limit theorem 
that underlies the asymptotic standard normal distribution for the f statistic does not apply: the ¢ statis- 
tic does not have an approximate standard normal distribution even in large sample sizes. The asymp- 
totic distribution of the ¢ statistic under Hy has come to be known as the Dickey-Fuller distribution 
after Dickey and Fuller (1979). 

Although we cannot use the usual critical values, we can use the usual ż statistic for 6 in (18.21), 
at least once the appropriate critical values have been tabulated. The resulting test is known as the 
Dickey-Fuller (DF) test for a unit root. The theory used to obtain the asymptotic critical values is 
rather complicated and is covered in advanced texts on time series econometrics. [See, for example, 
Banerjee, Dolado, Galbraith, and Hendry (1993), or BDGH for short.] By contrast, using these results 
is very easy. The critical values for the f statistic have been tabulated by several authors, beginning 
with the original work by Dickey and Fuller (1979). Table 18.2 contains the large sample critical 
values for various significance levels, taken from BDGH (1993, Table 4.2). (Critical values adjusted 
for small sample sizes are available in BDGH.) 

We reject the null hypothesis H,: 0 = 0 against H,: 6 < Oif t < c, where c is one of the negative 
values in Table 18.2. For example, to carry out the test at the 5% significance level, we reject if 
tg < —2.86. This requires a ¢ statistic with a much larger magnitude than if we used the standard nor- 
mal critical value, which would be —1.65. If we use the standard normal critical value to test for a unit 
root, we would reject Hy much more often than 5% of the time when H, is true. 


TABLE 18.2 Asymptotic Critical Values for Unit Root t Test: No Time Trend 
Significance level 1% 2.5% 5% 10% 
Critical value —3.43 =e —2.86 2S 
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EXAMPLE 18.2 Unit Root Test for Three-Month T-Bill Rates 


We use the quarterly data in INTQRT to test for a unit root in three-month T-bill rates. When we 
estimate (18.20), we obtain 


Ar3, = .625 — .091 r3,_, 
(.261) (.037) [18.22] 
n = 123, R? = .048, 


where we keep with our convention of reporting standard errors in parentheses below the estimates. 
We must remember that these standard errors cannot be used to construct usual confidence inter- 
vals or to carry out traditional t tests because these do not behave in the usual ways when there is 
a unit root. The coefficient on r3,_, shows that the estimate of p is p = 1 + 6 = .909. While this 
is less than unity, we do not know whether it is statistically less than one. The tf statistic on r3,_ | is 
—.091/.037 = —2.46. From Table 18.2, the 10% critical value is —2.57; therefore, we fail to reject 
Ho: p = | against H,: p < 1 at the 10% significance level. 


As with other hypothesis tests, when we fail to reject Ho, we do not say that we accept Hp. Why? 
Suppose we test Hp: p = .9 in the previous example using a standard f test—which is asymptotically 
valid, because y, is I(0) under Hy. Then, we obtain t = .001/.037, which is very small and provides no 
evidence against p = .9. Yet, it makes no sense to accept p = | andp = .9. 

When we fail to reject a unit root, as in the previous example, we should only conclude that the 
data do not provide strong evidence against Hp. In this example, the test does provide some evidence 
against Hy, because the ¢ statistic is close to the 10% critical value. (Ideally, we would compute a 
p-value, but this requires special software because of the nonnormal distribution.) In addition, though 
p = .91 implies a fair amount of persistence in {r3,}, the correlation between observations that are 
10 periods apart for an AR(1) model with p = .9 is about .35, rather than almost one if p = 1. 

What happens if we now want to use r3, as an explanatory variable in a regression analysis? The 
outcome of the unit root test implies that we should be extremely cautious: if r3, does have a unit root, the 
usual asymptotic approximations need not hold (as we discussed in Chapter 11). One solution is to use 
the first difference of r3, in any analysis. As we will see in Section 18-4, that is not the only possibility. 

We also need to test for unit roots in models with more complicated dynamics. If {y,} follows 
(18.17) with p = 1, then Ay, is serially uncorrelated. We can easily allow {Ay,} to follow an AR 
model by augmenting equation (18.21) with additional lags. For example, 


Ay, = a + Oy,-; + yiAy,-1 + ep [18.23] 


where |y,| < 1. This ensures that, under Hp: 6 = 0, {Ay,} follows a stable AR(1) model. Under the 
alternative H,: 6 < 0, it can be shown that {y,} follows a stable AR(2) model. 

More generally, we can add p lags of Ay, to the equation to account for the dynamics in the 
process. The way we test the null hypothesis of a unit root is very similar: we run the regression of 


Ay, on y;-1, Ay,-1; eae Ay,—p [18.24] 


and carry out the f test on 6, the coefficient on y,—1, Just as before. This extended version of the 
Dickey-Fuller test is usually called the augmented Dickey-Fuller test because the regression has 
been augmented with the lagged changes, Ay,_,. The critical values and rejection rule are the same 
as before. The inclusion of the lagged changes in (18.24) is intended to clean up any serial correla- 
tion in Ay,. The more lags we include in (18.24), the more initial observations we lose. If we include 
too many lags, the small sample power of the test generally suffers. But if we include too few lags, 
the size of the test will be incorrect, even asymptotically, because the validity of the critical values in 
Table 18.2 relies on the dynamics being completely modeled. Often, the lag length is dictated by the 
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frequency of the data (as well as the sample size). For annual data, one or two lags usually suffice. For 
monthly data, we might include 12 lags. But there are no hard rules to follow in any case. 

Interestingly, the t statistics on the lagged changes have approximate f distributions. The F statis- 
tics for joint significance of any group of terms Ay,—, are also asymptotically valid. (These maintain 
the homoskedasticity assumption discussed in Section 11-5.) Therefore, we can use standard tests to 
determine whether we have enough lagged changes in (18.24). 


EXAMPLE 18.3 Unit Root Test for Annual U.S. Inflation 


We use annual data on U.S. inflation, based on the CPI, to test for a unit root in inflation (see 
PHILLIPS), restricting ourselves to the years from 1948 through 2010. Allowing for one lag of Ainf, 
in the augmented Dickey-Fuller regression gives 


Ainf, = 1.21 — .300 inf,_, + .102 Ainf,_, 
(0.43) (.093) (.115) 
n = 61, R = .155. 


The ¢ statistic for the unit root test is —.300/.093 =~ —3.23. Because the 2.5% critical value is —3.12, 
and t < —3.12, we reject the unit root hypothesis at the 2.5% significance level. The estimate of p 
is À =.700, which is not particularly close to one. Therefore, along with the statistical rejection, we 
have pretty strong evidence against a unit root. The lag Ainf,_, has a t statistic of about .89, so we 
did not need to include it, but we could not know that ahead of time. If we drop Ainf,_,, the evidence 
against a unit root is slightly stronger: Ô = —.334 (fp = .666) and tj ~ —3.55, which leads to a 
rejection of a unit root at the 1% significance level. 


For series that have clear time trends, we need to modify the test for unit roots. A trend-stationary 
process—which has a linear trend in its mean but is I(0) about its trend—can be mistaken for a unit 
root process if we do not control for a time trend in the Dickey-Fuller regression. In other words, if we 
carry out the usual DF or augmented DF test on a trending but I(0) series, we will probably have little 
power for rejecting a unit root. 

To allow for series with time trends, we change the basic equation to 


Ay, = a + ôt + Oy,_; + e, [18.25] 


where again the null hypothesis is Hp: 6 = 0, and the alternative is H,: 0 < 0. Under the alternative, 
{y,} is a trend-stationary process. If y, has a unit root, then Ay, = a + ôt + e, and so the change in 
y, has a mean linear in f unless ô = 0. [It can be shown that E(y,) is actually a quadratic in t.] It is 
unusual for the first difference of an economic series to have a linear trend, so a more appropriate null 
hypothesis is probably Ho: 0 = 0, ô = 0. Although it is possible to test this joint hypothesis using 
an F test—but with modified critical values—it is common to test Hy: 0 = 0 using only a f test. We 
follow that approach here. [See BDGH (1993, Section 4-4) for more details on the joint test.] 

When we include a time trend in the regression, the critical values of the test change. Intuitively, 
this occurs because detrending a unit root process tends to make it look more like an I(0) process. 
Therefore, we require a larger magnitude for the ¢ statistic in order to reject Hy. The Dickey-Fuller 
critical values for the f test that includes a time trend are given in Table 18.3; they are taken from 


BDGH (1993, Table 4.2). 


TABLE 18.3 Asymptotic Critical Values for Unit Root ¢ Test: Linear Time Trend 
Significance level 1% 2.5% 5% 10% 


Critical value —3.96 —3.66 —3.41 =9 2 
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For example, to reject a unit root at the 5% level, we need the f statistic on 6, to be less than 
—3.41, as compared with —2.86 without a time trend. 

We can augment equation (18.25) with lags of Ay, to account for serial correlation, just as in the 
case without a trend. 


EXAMPLE 18.4 Unit Root in the Log of U.S. Real Gross Domestic Product 


We can apply the unit root test with a time trend to the U.S. GDP data in INVEN. These annual data 
cover the years from 1959 through 1995. We test whether log(GDP,) has a unit root. This series has a 
pronounced trend that looks roughly linear. We include a single lag of Alog(GDP,), which is simply 
the growth in GDP (in decimal form), to account for dynamics: 


gGDP, = 1.65 + .0059 t — .210 log(GDP,_,) + .264 gGDP,_, 
(.67) (.0027) (.087) (.165) [18.26] 
n = 35, R? = .268. 


From this equation, we get p = 1 — .21 = .79, which is clearly less than one. But we cannot reject 
a unit root in the log of GDP: the f statistic on log(GDP,_,) is —.210/.087 = —2.41, which is well 
above the 10% critical value of —3.12. The ¢ statistic on gGDP,_, is 1.60, which is almost significant 
at the 10% level against a two-sided alternative. 

What should we conclude about a unit root? Again, we cannot reject a unit root, but the point 
estimate of p is not especially close to one. When we have a small sample size—and n = 35 is con- 
sidered to be pretty small—it is very difficult to reject the null hypothesis of a unit root if the process 
has something close to a unit root. Using more data over longer time periods, many researchers have 
concluded that there is little evidence against the unit root hypothesis for log(GDP). This has led most 
of them to assume that the growth in GDP is I(0), which means that log(GDP) is I(1). Unfortunately, 
given currently available sample sizes, we cannot have much confidence in this conclusion. 

If we omit the time trend, there is much less evidence against Ho, as 6 = —.023 and tg = —1.92. 
Here, the estimate of p is much closer to one, but this is misleading due to the omitted time trend. 


It is tempting to compare the ż statistic on the time trend in (18.26) with the critical value from 
a standard normal or ż distribution, to see whether the time trend is significant. Unfortunately, the t 
statistic on the trend does not have an asymptotic standard normal distribution (unless |p| < 1). The 
asymptotic distribution of this ¢ statistic is known, but it is rarely used. Typically, we rely on intuition 
(or plots of the time series) to decide whether to include a trend in the DF test. 

There are many other variants on unit root tests. In one version that is applicable only to series 
that are clearly not trending, the intercept is omitted from the regression; that is, œ is set to zero in 
(18.21). This variant of the Dickey-Fuller test is rarely used because of biases induced if a # 0. Also, 
we can allow for more complicated time trends, such as quadratic. Again, this is seldom used. 

Another class of tests attempts to account for serial correlation in Ay, in a different manner than 
by including lags in (18.21) or (18.25). The approach is related to the serial correlation—robust stand- 
ard errors for the OLS estimators that we discussed in Section 12-5. The idea is to be as agnostic as 
possible about serial correlation in Ay,. In practice, the (augmented) Dickey-Fuller test has held up 
pretty well. [See BDGH (1993, Section 4-3) for a discussion on other tests. ] 


18-3 Spurious Regression 


In a cross-sectional environment, we use the phrase “spurious correlation” to describe a situation 
where two variables are related through their correlation with a third variable. In particular, if we 
regress y on x, we find a significant relationship. But when we control for another variable, say, z, the 
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partial effect of x on y becomes zero. Naturally, this can also happen in time series contexts with I(0) 
variables. 

As we discussed in Section 10-5, it is possible to find a spurious relationship between time series 
that have increasing or decreasing trends. Provided the series are weakly dependent about their time 
trends, the problem is effectively solved by including a time trend in the regression model. 

When we are dealing with integrated processes of order one, there is an additional complication. 
Even if the two series have means that are not trending, a simple regression involving two independ- 
ent I(1) series will often result in a significant ¢ statistic. 

To be more precise, let {x,} and {y,} be random walks generated by 


X= Xx- +a, t=1,2,..., [18.27] 
and 
Yt = Vr-1 + 4, t= | eee [18.28] 


where {a,} and {e,} are independent, identically distributed innovations, with mean zero and vari- 
ances o2 and g2, respectively. For concreteness, take the initial values to be x) = yọ = 0. Assume fur- 
ther that {a,} and {e,} are independent processes. This implies that {x,} and {y,} are also independent. 
But what if we run the simple regression 


9, = Bo + Bix, [18.29] 


and obtain the usual ż statistic for Êi and the usual R-squared? Because y, and x, are independent, 
we would hope that plim A = 0. Even more importantly, if we test Hp: 6, = 0 against Hı: 6, # 0 
at the 5% level, we hope that the ¢ statistic for B ı is insignificant 95% of the time. Through a simula- 
tion, Granger and Newbold (1974) showed that this is not the case: even though y, and x, are inde- 
pendent, the regression of y, on x, yields a statistically significant ż statistic a large percentage of the 
time, much larger than the nominal significance level. Granger and Newbold called this the spurious 
regression problem: there is no sense in which y and x are related, but an OLS regression using the 
usual f statistics will often indicate a relationship. 

More recent simulation results are given by Davidson and MacKinnon (1993, Table 19.1), where 
a, and e, are generated as independent, identically 
distributed normal random variables, and 10,000 dif- 
ferent samples are generated. For a sample size of 
n = 50 at the 5% significance level, the standard t 
statistic for Hp: 6; = 0 against the two-sided alter- 
native rejects Hy about 66.2% of the time under Ho, 
rather than 5% of the time. As the sample size 
increases, things get worse: with n = 250, the null is 
rejected 84.7% of the time! 

Here is one way to see what is happening when we regress the level of y on the level of x. Write 
the model underlying (18.29) as 


GOING FURTHER 18.2 


Under the preceding setup, where {x;} and 
{y,} are generated by (18.27) and (18.28) 
and {e,} and {a} are i.i.d. sequences, what 
is the plim of the slope coefficient, say, 94, 
from the regression of Ay, on Ax,? Describe 
the behavior of the t statistic of 7,. 


yy = Bo + Bix, + u, [18.30] 


For the f statistic of B ı to have an approximate standard normal distribution in large samples, at a min- 
imum, {u,} should be a mean zero, serially uncorrelated process. But under Hy: 8, = 0, y, = Bo + Up 
and, because {y,} is a random walk starting at yọ = 0, equation (18.30) holds under Ho only if By = 0 
and, more importantly, if u, = y, = >)}~1e;. In other words, {u,} is a random walk under Hp. This 
clearly violates even the asymptotic version of the Gauss-Markov assumptions from Chapter 11. 

Including a time trend does not really change the conclusion. If y, or x, is a random walk with 
drift and a time trend is not included, the spurious regression problem is even worse. The same quali- 
tative conclusions hold if {a,} and {e,} are general I(0) processes, rather than i.i.d. sequences. 
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In addition to the usual ¢ statistic not having a limiting standard normal distribution—in fact, it 
increases to infinity as n —> o—the behavior of R-squared is nonstandard. In cross-sectional con- 
texts or in regressions with I(0) time series variables, the R-squared converges in probability to the 
population R-squared: 1 — o?/o2. This is not the case in spurious regressions with I(1) processes. 
Rather than the R-squared having a well-defined plim, it actually converges to a random variable. 
Formalizing this notion is well beyond the scope of this text. [A discussion of the asymptotic proper- 
ties of the ¢ statistic and the R-squared can be found in BDGH (Section 3-1).] The implication is that 
the R-squared is large with high probability, even though {y,} and {x,} are independent time series 
processes. 

The same considerations arise with multiple independent variables, each of which may be I(1) or 
some of which may be I(0). If {y,} is I(1) and at least some of the explanatory variables are I(1), the 
regression results may be spurious. 

The possibility of spurious regression with I(1) variables is quite important and has led econo- 
mists to reexamine many aggregate time series regressions whose f statistics were very significant and 
whose R-squareds were extremely high. In the next section, we show that regressing an I(1) depend- 
ent variable on an I(1) independent variable can be informative, but only if these variables are related 
in a precise sense. 


18-4 Cointegration and Error Correction Models 


The discussion of spurious regression in the previous section certainly makes one wary of using the 
levels of I(1) variables in regression analysis. In earlier chapters, we suggested that I(1) variables 
should be differenced before they are used in linear regression models, whether they are estimated by 
OLS or instrumental variables. This is certainly a safe course to follow, and it is the approach used in 
many time series regressions after Granger and Newbold’s original paper on the spurious regression 
problem. Unfortunately, always differencing I(1) variables limits the scope of the questions that we 
can answer. 


18-4a Cointegration 


The notion of cointegration, which was given a formal treatment in Engle and Granger (1987), 
makes regressions involving I(1) variables potentially meaningful. A full treatment of cointegration 
is mathematically involved, but we can describe the basic issues and methods that are used in many 
applications. 

If {y; t = 0, 1,...} and {x; t = 0, 1,...} are two I(1) processes, then, in general, y, — Bx, is 
an I(1) process for any number B. Nevertheless, it is possible that for some B # 0, y, — Bx, is an 
1(0) process, which means it has constant mean, constant variance, and autocorrelations that depend 
only on the time distance between any two variables in the series, and it is asymptotically uncorre- 

lated. If such a B exists, we say that y and x are coin- 
GOING FURTHER 18.3 tegrated, and we call 6 the cointegration parameter. 
[Alternatively, we could look at x, — yy, for y # 0: if 
Let {(y;, x): t = 1, 2, ...} be a bivariate time | y, — Bx, is I(0), then x, — (1/f)y, is (0). Therefore, 
series where each series is I(1) without drift. | the linear combination of y, and x, is not unique, but if 
Spkin wy WY, 2na x, are cointegrated, y: | Wwe fix the coefficient on y, at unity, then £ is unique. 
anel i- ere elo collegiate: See Problem 3 at end of chapter. For concreteness, we 
consider linear combinations of the form y, — Bx,.] 

For the sake of illustration, take B = 1, suppose that yọ = x) = 0, and write y, = y,_; + 7, 
X, = x,-, + v, where {r,} and {v,} are two I(0) processes with zero means. Then, y, and x, have a 
tendency to wander around and not return to the initial value of zero with any regularity. By contrast, 
if y, — x, is I(0), it has zero mean and does return to zero with some regularity. 
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As a specific example, let r6, be the annualized interest rate for six-month T-bills (at the end of 
quarter f) and let r3, be the annualized interest rate for three-month T-bills. (These are typically called 
bond equivalent yields, and they are reported in the financial pages.) In Example 18.2, using the data 
in INTQRT, we found little evidence against the hypothesis that r3, has a unit root; the same is true 
of r6,. Define the spread between six- and three-month T-bill rates as spr, = r6, — r3,. Then, using 
equation (18.21), the Dickey-Fuller ¢ statistic for spr, is —7.71 (with 6 = —.67or p = .33). Therefore, 
we strongly reject a unit root for spr, in favor of I(0). The upshot of this is that though r6, and r3, each 
appear to be unit root processes, the difference between them is an I(0) process. In other words, r6 
and r3 are cointegrated. 

Cointegration in this example, as in many examples, has an economic interpretation. If r6 and 
r3 were not cointegrated, the difference between interest rates could become very large, with no ten- 
dency for them to come back together. Based on a simple arbitrage argument, this seems unlikely. 
Suppose that the spread spr, continues to grow for several time periods, making six-month T-bills 
a much more desirable investment. Then, investors would shift away from three-month and toward 
six-month T-bills, driving up the price of six-month T-bills, while lowering the price of three-month 
T-bills. Because interest rates are inversely related to price, this would lower r6 and increase r3, until 
the spread is reduced. Therefore, large deviations between r6 and r3 are not expected to continue: the 
spread has a tendency to return to its mean value. (The spread actually has a slightly positive mean 
because long-term investors are more rewarded relative to short-term investors.) 

There is another way to characterize the fact that spr, will not deviate for long periods from 
its average value: r6 and r3 have a long-run relationship. To describe what we mean by this, let 
u = E(spr,) denote the expected value of the spread. Then, we can write 


16, = 73, + wt e, 


where {e,} is a zero mean, I(0) process. The equilibrium or long-run relationship occurs when e, = 0, 
or r6" = r3" + yw. At any time period, there can be deviations from equilibrium, but they will be tem- 
porary: there are economic forces that drive r6 and r3 back toward the equilibrium relationship. 

In the interest rate example, we used economic reasoning to tell us the value of £ if y, and x, are 
cointegrated. If we have a hypothesized value of 8, then testing whether two series are cointegrated is 
easy: we simply define a new variable, s, = y, — x, and apply either the usual DF or augmented DF 
test to {s,}. If we reject a unit root in {s,} in favor of the I(0) alternative, then we find that y, and x, are 
cointegrated. In other words, the null hypothesis is that y, and x, are not cointegrated. 

Testing for cointegration is more difficult when the (potential) cointegration parameter £ is 
unknown. Rather than test for a unit root in {s,}, we must first estimate £. If y, and x, are cointegrated, 
it turns out that the OLS estimator B from the regression 


y, = â + Bx, [18.31] 


is consistent for 8. The problem is that the null hypothesis states that the two series are not cointe- 
grated, which means that, under Ho, we are running a spurious regression. Fortunately, it is possible 
to tabulate critical values even when £ is estimated, where we apply the Dickey-Fuller or augmented 
Dickey-Fuller test to the residuals, say, i, = y, — @ — Bx, from (18.31). The only difference is that 
the critical values account for estimation of 6. The resulting test is called the Engle-Granger test, and 
the asymptotic critical values are given in Table 18.4. These are taken from Davidson and MacKinnon 


(1993, Table 20.2). 


TABLE 18.4 Asymptotic Critical Values for Cointegration Test: No Time Trend 


Significance level 
Critical value 
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TABLE 18.5 Asymptotic Critical Values for Cointegration Test: Linear Time Trend 
Significance level 1% 2.5% 5% 10% 
Critical value —4,32 —4.03 —3.78 —3.50 


In the basic test, we run the regression of Ad, on #,_; and compare the f statistic on a,_, to the 
desired critical value in Table 18.4. If the ż statistic is below the critical value, we have evidence that 
y, — Bx, is I(0) for some £; that is, y, and x, are cointegrated. We can add lags of Aa, to account for 
serial correlation. If we compare the critical values in Table 18.4 with those in Table 18.2, we must get 
a t statistic much larger in magnitude to find cointegration than if we used the usual DF critical values. 
This happens because OLS, which minimizes the sum of squared residuals, tends to produce residuals 
that look like an I(0) sequence even if y, and x, are not cointegrated. 

As with the usual Dickey-Fuller test, we can augment the Engle-Granger test by including lags of 
Ait, as additional regressors. 

If y, and x, are not cointegrated, a regression of y, on x, is spurious and tells us nothing meaning- 
ful: there is no long-run relationship between y and x. We can still run a regression involving the first 
differences, Ay, and Ax,, including lags. But we should interpret these regressions for what they are: 
they explain the difference in y in terms of the difference in x and have nothing necessarily to do with 
a relationship in levels. 

If y, and x, are cointegrated, we can use this to specify more general dynamic models, as we will 
see in the next subsection. 

The previous discussion assumes that neither y, nor x, has a drift. This is reasonable for interest 
rates but not for other time series. If y, and x, contain drift terms, E(y,) and E(x,) are linear (usually 
increasing) functions of time. The strict definition of cointegration requires y, — Bx, to be I(0) without 
a trend. To see what this entails, write y, = ôt + g, and x, = At + h, where {g,} and {h,} are I(1) 
processes, ô is the drift in y[6 = E(Ay,)], and A is the drift in x,[A = E(Ax,) ]. Now, if y, and x, are 
cointegrated, there must exist 6 such that g, — Bh, is 1(0). But then 


yı — BX, = (6 BA)t (g Bh,), 


which is generally a trend-stationary process. The strict form of cointegration requires that there not 
be a trend, which means 6 = BA. For I(1) processes with drift, it is possible that the stochastic parts— 
that is, g, and h,—are cointegrated, but that the parameter £ that causes g, — Bh, to be I(0) does not 
eliminate the linear time trend. 

We can test for cointegration between g, and h, without taking a stand on the trend part, by run- 
ning the regression 


$ =â + At + Bx, [18.32] 


and applying the usual DF or augmented DF test to the residuals ĉ,. The asymptotic critical values are 
given in Table 18.5 [from Davidson and MacKinnon (1993, Table 20.2)]. 

A finding of cointegration in this case leaves open the possibility that y, — Bx, has a linear trend. 
But at least it is not I(1). 


EXAMPLE 18.5 Cointegration between Fertility and Personal Exemption 


In Chapters 10 and 11, we studied various models to estimate the relationship between the general 
fertility rate (gfr) and the real value of the personal tax exemption (pe) in the United States. The static 
regression results in levels and first differences are notably different. The regression in levels, with a 
time trend included, gives an OLS coefficient on pe equal to .187 (se = .035) and R? = .500. In first 
differences (without a trend), the coefficient on Ape is —.043 (se = .028) and R? = .032. Although 
there are other reasons for these differences—such as misspecified distributed lag dynamics—the 
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discrepancy between the levels and changes regressions suggests that we should test for cointegration. 
Of course, this presumes that gfr and pe are I(1) processes. This appears to be the case: the augmented 
DF tests, with a single lagged change and a linear time trend, each yield ¢ statistics of about —1.47, 
and the estimated AR(1) coefficients are close to one. 

When we obtain the residuals from the regression of gfr on t and pe and apply the augmented 
DF test with one lag, we obtain a f statistic on #,_, of —2.43, which is nowhere near the 10% criti- 
cal value, —3.50. Therefore, we must conclude that there is little evidence of cointegration between 
gfr and pe, even allowing for separate trends. It is very likely that the earlier regression results we 
obtained in levels suffer from the spurious regression problem. 

The good news is that, when we used first differences and allowed for two lags—see 
equation (11.27)—we found an overall positive and significant long-run effect of Ape on Agfr. 


If we think two series are cointegrated, we often want to test hypotheses about the cointegrating 
parameter. For example, a theory may state that the cointegrating parameter is one. Ideally, we could 
use a f Statistic to test this hypothesis. 

We explicitly cover the case without time trends, although the extension to the linear trend case is 
immediate. When y, and x, are I(1) and cointegrated, we can write 


y, =a + Bx, + u, [18.33] 


where u, is a zero mean, I(0) process. Generally, {u,} contains serial correlation, but we know from 
Chapter 11 that this does not affect consistency of OLS. As mentioned earlier, OLS applied to (18.33) 
consistently estimates 6 (and a). Unfortunately, because x, is I(1), the usual inference procedures do 
not necessarily apply: OLS is not asymptotically normally distributed, and the ¢ statistic for B does 
not necessarily have an approximate ¢ distribution. We do know from Chapter 10 that, if {x,} is strictly 
exogenous—see Assumption TS.3—and the errors are homoskedastic, serially uncorrelated, and nor- 
mally distributed, the OLS estimator is also normally distributed (conditional on the explanatory vari- 
ables) and the t statistic has an exact f distribution. Unfortunately, these assumptions are too strong to 
apply to most situations. The notion of cointegration implies nothing about the relationship between 
{x,} and {u,}—indeed, they can be arbitrarily correlated. Further, except for requiring that {u,} is 1(0), 
cointegration between y, and x, does not restrict the serial dependence in {u,}. 

Fortunately, the feature of (18.33) that makes inference the most difficult—the lack of strict 
exogeneity of {x,}—-can be fixed. Because x, is I(1), the proper notion of strict exogeneity is that u, 
is uncorrelated with Ax,, for all t and s. We can always arrange this for a new set of errors, at least 
approximately, by writing u, as a function of the Ax, for all s close to t. For example, 


u, = N + hoAx, + pAx -1 + Ax- 


[18.34] 
+ yi Axia + YAX + ep 


where, by construction, e, is uncorrelated with each Ax, appearing in the equation. The hope is that e, 
is uncorrelated with further lags and leads of Ax,. We know that, as |s — ¢| gets large, the correlation 
between e, and Ax, approaches zero, because these are I(0) processes. Now, if we plug (18.34) into 
(18.33), we obtain 


Yy: = Ay + Bx, + hoAx, + Gy Ax, + bAx,_, 


18.35 
+ yj Axa, + YAX + en [ l 


This equation looks a bit strange because future Ax, appear with both current and lagged Ax,. The 
key is that the coefficient on x, is still 6, and, by construction, x, is now strictly exogenous in this 
equation. The strict exogeneity assumption is the important condition needed to obtain an approxi- 
mately normal f statistic for B. If u, is uncorrelated with all Ax, s # t, then we can drop the leads 
and lags of the changes and simply include the contemporaneous change, Ax,. Then, the equation 
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we estimate looks more standard but still includes the first difference of x, along with its level: 
y, = a) + Bx, + doAx, + e, In effect, adding Ax, solves any contemporaneous endogeneity between 
x, and u,. (Remember, any endogeneity does not cause inconsistency. But we are trying to obtain an 
asymptotically normal t statistic.) Whether we need to include leads and lags of the changes, and how 
many, is really an empirical issue. Each time we add an additional lead or lag, we lose one observa- 
tion, and this can be costly unless we have a large data set. 

The OLS estimator of 6 from (18.35) is called the leads and lags estimator of 8B because of 
the way it employs Ax. [See, for example, Stock and Watson (1993).] The only issue we must worry 
about in (18.35) is the possibility of serial correlation in {e,}. This can be dealt with by computing 
a serial correlation—robust standard error for B (as described in Section 12-5) or by using a standard 
AR(1) correction (such as Cochrane-Orcutt). 


EXAMPLE 18.6 Cointegrating Parameter for Interest Rates 


Earlier, we tested for cointegration between r6 and r3—six- and three-month T-bill rates—by assum- 
ing that the cointegrating parameter was equal to one. This led us to find cointegration and, naturally, 
to conclude that the cointegrating parameter is equal to unity. Nevertheless, let us estimate the cointe- 
grating parameter directly and test Hj: 6 = 1. We apply the leads and lags estimator with two leads and 
two lags of Ar3, as well as the contemporaneous change. The estimate of B is B = 1.038, and the usual 
OLS standard error is .0081. Therefore, the ż statistic for Hp: B = 1 is (1.038 — 1)/.0081 = 4.69, 
which is a strong statistical rejection of Hy. (Of course, whether 1.038 is economically different from 
1 is a relevant consideration.) There is little evidence of serial correlation in the residuals, so we can 
use this ¢ statistic as having an approximate normal distribution. [For comparison, the OLS estimate 
of B without the leads, lags, or contemporaneous Ar3 terms—and using five more observations—is 
1.026 (se = .0077). But the ż statistic from (18.33) is not necessarily valid.] 


There are many other estimators of cointegrating parameters, and this continues to be a very 
active area of research. The notion of cointegration applies to more than two processes, but the inter- 
pretation, testing, and estimation are much more complicated. One issue is that, even after we nor- 
malize a coefficient to be one, there can be many cointegrating relationships. BDGH provide some 
discussion and several references. 


18-4b Error Correction Models 


In addition to learning about a potential long-run relationship between two series, the concept of coin- 
tegration enriches the kinds of dynamic models at our disposal. If y, and x, are I(1) processes and are 
not cointegrated, we might estimate a dynamic model in first differences. As an example, consider the 
equation 


Ay, = a + ayAy,-; + yoAx, + YAX, -1 + u, [18.36] 


where u, has zero mean given Ax, Ay,_,, Ax,_,, and further lags. This is essentially equation (18.16), 
but in first differences rather than in levels. If we view this as a rational distributed lag model, we can 
find the impact propensity, long-run propensity, and lag distribution for Ay as a distributed lag in Ax. 
If y, and x, are cointegrated with parameter 6, then we have additional I(0) variables that we can 
include in (18.36). Let s, = y, — Bx,, so that s, is I(0), and assume for the sake of simplicity that s, has 
zero mean. Now, we can include lags of s, in the equation. In the simplest case, we include one lag of s;: 


Ay, = @% + Ay, -1 + yoAx, + yAx,-) + 6s,-) + uy, 


18.37 
= a + a Ay,_) + yoAx, + yiAx, -1 + ôly, -1 = Bea + Uy, l l 
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where E(u,|J,_,;) = 0, and J,_, contains information on Ax, and all past values of x and y. The term 
5(y,-, — Bx,-1) is called the error correction term, and (18.37) is an example of an error correc- 
tion model. (In some error correction models, the contemporaneous change in x, Ax,, is omitted. 
Whether it is included or not depends partly on the purpose of the equation. In forecasting, Ax, is 
rarely included, for reasons we will see in Section 18-5.) 

An error correction model allows us to study the short-run dynamics in the relationship between 
y and x. For simplicity, consider the model without lags of Ay, and Ax; 


Ay, = a + YoAx, + ôy,- T Bx,-1) + up [18.38] 


where ô < 0. If y,_,; > 6x,_,, then y in the previous period has overshot the equilibrium; because 
6 < 0, the error correction term works to push y back toward the equilibrium. Similarly, if 
y,-1 < Bx,-1, the error correction term induces a positive change in y back toward the equilibrium. 

How do we estimate the parameters of an error correction model? If we know $£, this is easy. For 
example, in (18.38), we simply regress Ay, on Ax, and s,_,, where s,_, = (y,-1 — Bx,-1). 


EXAMPLE 18.7 Error Correction Model for Holding Yields 


In Problem 6 in Chapter 11, we regressed hy6,, the three-month holding yield (in percent) from buy- 
ing a six-month T-bill at time ¢ — 1 and selling it at time f as a three-month T-bill, on hy3,_,, the 
three-month holding yield from buying a three-month T-bill at time ¢ — 1. The expectations hypoth- 
esis implies that the slope coefficient should not be 
GOING FURTHER 18.4 statistically different from one. It turns out that there 
is evidence of a unit root in {hy3,}, which calls into 
question the standard regression analysis. We will 
assume that both holding yields are I(1) processes. 
The expectations hypothesis implies, at a minimum, 
that hy6, and hy3,_, are cointegrated with B equal to one, which appears to be the case (see Computer 
Exercise C5). Under this assumption, an error correction model is 


Ahy6, = æo + yoAhy3,_, + 5(hy6,-; — hy3,-2) + up 


How would you test Hg: yp = 1,6 = —1 in 
the holding yield error correction model? 


where u, has zero mean, given all hy3 and hy6 dated at time t — 1 and earlier. The lags on the variables 
in the error correction model are dictated by the expectations hypothesis. 
Using the data in INTQRT gives 


Ahy6, = .090 + 1.218 Ahy3,_, — .840(hy6,_, — hy3,_») 
(043) (.264) (244) [18.39] 
n = 122, R? = .790. 


The error correction coefficient is negative and very significant. For example, if the holding yield on 
six-month T-bills is above that for three-month T-bills by one point, hy6 falls by .84 points on average 
in the next quarter. Interestingly, 5 = —.84 is not statistically different from —1, as is easily seen by 
computing the 95% confidence interval. 


In many other examples, the cointegrating parameter must be estimated. Then, we replace s,_; 
with $,- = y,-| — Êx,- 1» where B can be various estimators of 6. We have covered the standard OLS 
estimator as well as the leads and lags estimator. This raises the issue about how sampling variation 
in B affects inference on the other parameters in the error correction model. Fortunately, as shown 
by Engle and Granger (1987), we can ignore the preliminary estimation of 6 (asymptotically). This 
property is very convenient and implies that the asymptotic efficiency of the estimators of the param- 
eters in the error correction model is unaffected by whether we use the OLS estimator or the leads and 
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lags estimator for Ê. Of course, the choice of Ê will generally have an effect on the estimated error 
correction parameters in any particular sample, but we have no systematic way of deciding which 
preliminary estimator of 6 to use. The procedure of replacing B with B is called the Engle-Granger 
two-step procedure. 


18-5 Forecasting 


Forecasting economic time series is very important in some branches of economics, and it is an 
area that continues to be actively studied. In this section, we focus on regression-based forecasting 
methods. Diebold (2001) provides a comprehensive introduction to forecasting, including recent 
developments. 

We assume in this section that the primary focus is on forecasting future values of a time series 
process and not necessarily on estimating causal or structural economic models. 

It is useful to first cover some fundamentals of forecasting that do not depend on a specific 
model. Suppose that at time t we want to forecast the outcome of y at time t + 1, or y,,,. The time 
period could correspond to a year, a quarter, a month, a week, or even a day. Let J, denote information 
that we can observe at time t. This information set includes y,, earlier values of y, and often other 
variables dated at time ¢ or earlier. We can combine this information in innumerable ways to forecast 
y,41- Is there one best way? 

The answer is yes, provided we specify the loss associated with forecast error. Let f, denote 
the forecast of y,,, made at time t. We call f, a one-step-ahead forecast. The forecast error is 
€,41 = Y;+1 — fp Which we observe once the outcome on y,,, is observed. The most common meas- 
ure of loss is the same one that leads to ordinary least squares estimation of a multiple linear regres- 
sion model: the squared error, e?, ;. The squared forecast error treats positive and negative prediction 
errors symmetrically, and larger forecast errors receive relatively more weight. For example, errors of 
+2 and —2 yield the same loss, and the loss is four times as great as forecast errors of +1 or — 1. The 
squared forecast error is an example of a loss function. Another popular loss function is the absolute 
value of the prediction error, |e,,, ;|. For reasons to be seen shortly, we focus now on squared error loss. 

Given the squared error loss function, we can determine how to best use the information at time 
t to forecast y,,,. But we must recognize that at time t, we do not know e,, ;: it is a random variable, 
because y,,, is a random variable. Therefore, any useful criterion for choosing f, must be based on 
what we know at time f. It is natural to choose the forecast to minimize the expected squared forecast 
error, given Z; 


Eleal) = Elmi = ft) Uhl. [18.40] 


A basic fact from probability (see Property CE.6 in Math Refresher B) is that the conditional expec- 
tation, E(y,,,|J,), minimizes (18.40). In other words, if we wish to minimize the expected squared 
forecast error given information at time ¢, our forecast should be the expected value of y,+, given vari- 
ables we know at time t. 

For many popular time series processes, the conditional expectation is easy to obtain. Suppose 
that {y,: t = 0, 1,...} is a martingale difference sequence (MDS) and take J, to be {y,, y,_),..-. Yo}. 
the observed past of y. By definition, E(y,,,|J,) = 0 for all t; the best prediction of y,,, at time ¢ is 
always zero! Recall from Section 18-2 that an i.i.d. sequence with zero mean is a martingale differ- 
ence sequence. 

A martingale difference sequence is one in which the past is not useful for predicting the future. 
Stock returns are widely thought to be well approximated as an MDS or, perhaps, with a positive 
mean. The key is that E(y,4 |v, y,-1,---) = E(y,+ 1): the conditional mean is equal to the uncondi- 
tional mean, in which case past outcomes on y do not help to predict future y. 


CHAPTER 18 Advanced Time Series Topics 623 


A process {y,} is a martingale if E(y,, ly, Y- <- -> Yo) = y, for all t = 0. [If {y,} is a martin- 
gale, then {Ay,} is a martingale difference sequence, which is where the latter name comes from.] The 
predicted value of y for the next period is always the value of y for this period. 

A more complicated example is 


E(y,+il,) = ay, + a(1 = a)y,- +++ + a(l — a)Yo [18.41] 


where 0 < a < | is a parameter that we must choose. This method of forecasting is called exponen- 
tial smoothing because the weights on the lagged y decline to zero exponentially. 

The reason for writing the expectation as in (18.41) is that it leads to a very simple recurrence 
relation. Set fọ = yo. Then, for t = 1, the forecasts can be obtained as 


f= ay, + (1 = afin. 


In other words, the forecast of y,,, is a weighted average of y, and the forecast of y, made at time 
t — 1. Exponential smoothing is suitable only for very specific time series and requires choosing a. 
Regression methods, which we turn to next, are more flexible. 

The previous discussion has focused on forecasting y only one period ahead. The general issues 
that arise in forecasting y,,,, at time t, where h is any positive integer, are similar. In particular, if we 
use expected squared forecast error as our measure of loss, the best predictor is E(y,,.;,|/,). When deal- 
ing with a multiple-step-ahead forecast, we use the notation f, „to indicate the forecast of y, +, made 
at time t. 


18-5a Types of Regression Models Used for Forecasting 


There are many different regression models that we can use to forecast future values of a time series. 
The first regression model for time series data from Chapter 10 was the static model. To see how we 
can forecast with this model, assume that we have a single explanatory variable: 


V= Bo + Piz; + Uy. [18.42] 


Suppose, for the moment, that the parameters Bọ and £; are known. Write this equation at time 
t+ las y+; = Bo + Biz+, + uy 41. Now, if z1 is known at time t, so that it is an element of Z, and 
E(u,+;{J,) = 0, then 


E(y,+ 11h) = Bo + BiZ 1 


where /, contains Z,+ 1, Y» Zp -< -> Yi Z1- The right-hand side of this equation is the forecast of y,+, 
at time t. This kind of forecast is usually called a conditional forecast because it is conditional on 
knowing the value of z at time ¢ + 1. 

Unfortunately, at any time, we rarely know the value of the explanatory variables in future time 
periods. Exceptions include time trends and seasonal dummy variables, which we cover explicitly 
below, but otherwise knowledge of z,,., at time ¢ is rare. Sometimes, we wish to generate conditional 
forecasts for several values of z,+ 1. 

Another problem with (18.42) as a model for forecasting is that E(u,, ,|J,) = 0 means that {u,} 
cannot contain serial correlation, something we have seen to be false in most static regression models. 
[Problem 8 asks you to derive the forecast in a simple distributed lag model with AR(1) errors.] 

If z+; is not known at time t, we cannot include it in J,. Then, we have 


Ey, lZ) = fo + BElz+ ill). 


This means that in order to forecast y,,,, we must first forecast z,,,, based on the same information 
set. This is usually called an unconditional forecast because we do not assume knowledge of z,+ at 
time t. Unfortunately, this is somewhat of a misnomer, as our forecast is still conditional on the infor- 
mation in J,. But the name is entrenched in the forecasting literature. 
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For forecasting, unless we are wedded to the static model in (18.42) for other reasons, it makes 
more sense to specify a model that depends only on lagged values of y and z. This saves us the extra 
step of having to forecast a right-hand side variable before forecasting y. The kind of model we have 
in mind is 


Yi = Ôo + Ayri H V1 + Uy 


E(u, -1) = 0, iii 


where /,_,; contains y and z dated at time ft — | and earlier. Now, the forecast of y,,, at time ż is 
dy) + ay, + YZ} if we know the parameters, we can just plug in the values of y, and z,. 

If we only want to use past y to predict future y, then we can drop z,_, from (18.43). Naturally, 
we can add more lags of y or z and lags of other variables. Especially for forecasting one step ahead, 
such models can be very useful. 


18-5b One-Step-Ahead Forecasting 


Obtaining a forecast one period after the sample ends is relatively straightforward using models such 
as (18.43). As usual, let n be the sample size. The forecast of y, , , is 


Ê, = do + QV n + Fizn [18.44] 


where we assume that the parameters have been estimated by OLS. We use a hat on f, to emphasize 
that we have estimated the parameters in the regression model. (If we knew the parameters, there 
would be no estimation error in the forecast.) The forecast error—which we will not know until time 
n+ 1—is 


ent = Yat1 7 Ta [18.45] 


If we add more lags of y or z to the forecasting equation, we simply lose more observations at the 
beginning of the sample. 

The forecast f, of y, ,, is usually called a point forecast. We can also obtain a forecast interval. 
A forecast interval is essentially the same as a prediction interval, which we studied in Section 6-4. 
There we showed how, under the classical linear model assumptions, to obtain an exact 95% prediction 
interval. A forecast interval is obtained in exactly the same way. If the model does not satisfy the classi- 
cal linear model assumptions—for example, if it contains lagged dependent variables, as in (18.44)— 
the forecast interval is still approximately valid, provided u, given J,_ , is normally distributed with 
zero mean and constant variance. (This ensures that the OLS estimators are approximately normally 
distributed with the usual OLS variances and that u,,,, is independent of the OLS estimators with 
mean zero and variance a.) Let se(f,) be the standard error of the forecast and let & be the standard 
error of the regression. [From Section 6-4, we can obtain Ê, and se( Ê) as the intercept and its standard 


error from the regression of y, on (y,_; — y,) and (z,_,; — z,),¢ = 1, 2, ... , n; that is, we subtract the 
time n value of y from each lagged y, and similarly for z, before doing the regression.] Then, 
se(2,+1) = {[se(f,) P + 67}'?, [18.46] 


and the (approximate) 95% forecast interval is 
f, = 1.96-se(2, 1). [18.47] 


Because se(f,) is roughly proportional to 1/V/n, se(fry,) is usually small relative to the uncertainty in 
the error u,,,;, as measured by ô. 

Many econometrics packages have built-in commands that will produce the standard error in 
equation (18.46). Otherwise, it requires some simple manipulations of typical regression output. As 
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a check of our calculations, we can use the same regression used to obtain studentized residuals in 
Section 9-5c. In particular, suppose we want to forecast y, for t = n + 1, and we use the first n obser- 
vations to estimate the parameters. It turns out that the standard error in (18.46) is obtained as the 
standard error on a dummy variable indicating observation n + 1, when we use all n + 1 observa- 
tions in the regression. In other words, define dnp1, (where “np1” stands for “n plus one”) to be equal 
to one ift = n + 1 and zero otherwise. Then run the regression 


y, on y,-1,Z,—1,dnpl, t = 2,...,n + 1 


The coefficient on dnp1, is actually the forecast error, @,.; = Yn+ı — Ís Even more importantly, 
the standard error on the coefficient is the standard error in (18.46). This trick can be handy if one’s 
econometric package does not directly compute a standard error for a forecast. 


EXAMPLE 18.8 Forecasting the U.S. Unemployment Rate 


We use the data in PHILLIPS, but only for the years 1948 through 2010, to forecast the U.S. civilian 
unemployment rate for 2011. We use two models. The first is a simple AR(1) model for unem: 


unem, = 1.366 + .775 unem,_ | 
(.524) (.090) [18.48] 
n = 62, R? = 547, & = 1.065 


unem, = 1.085 + .719 unem,_,; + .158 inf,_, 
(.482) (.083) (.043) [18.49] 
n = 62, R? = .626, & = .967. 


The lagged inflation rate is very significant in (18.49) (t ~ 3.7), and the adjusted R-squared from the 
second equation is much higher than that from the first. Nevertheless, this does not necessarily mean 
that the second equation will produce a better forecast for 2011. All we can say so far is that, using the 
data up through 2010, a lag of inflation helps to explain variation in the unemployment rate. 

To obtain the forecasts for 2011, we need to know unem and inf in 2010. These are 9.6 and 1.6, 
respectively. Therefore, the forecast of unem; from equation (18.48) is 1.366 + .775(9.6) ~ 8.81. 
The forecast from equation (18.49) is 1.085 + .719(9.6) + .158(1.6) ~ 8.24. (Both of these forecasts 
will be slightly different if you use an econometrics package to obtain them directly, as the coefficient 
estimates in the equations have been rounded to three decimal places.) The actual civilian unemploy- 
ment rate for 2011 was 8.9, and so the forecast from the simpler model provides a much better fore- 
cast. But remember, this is just for one year, and it was during the global Great Recession that started 


at the end of 2008. 
We can easily obtain a 95% forecast interval. When we regress unem, on (unem,_, — 9.6) and 
(inf,_; — 1.6), the interceptis 8.24—the forecast we already computed—and the standard error associ- 


ated with the intercept is se (f,) = .376. Also, & = .967, and so se(é,,) = [(.376)? + (.967)?]!? ~ 
1.038. The 95% forecast interval from (18.47) is 8.24 + 1.96 (1.038), or about [6.2, 10.3]. This is a 
wide interval, and the realized 2011 value, 8.9 is well within the interval. As expected, the standard 
error of u, 4.1, .967, is a very large fraction of se(é, 41). 


A professional forecaster must usually produce a forecast for every time period. For example, 
at time n, she or he produces a forecast of y,,,. Then, when y, ,, and z,,,, become available, he or 
she must forecast y„+2. Even if the forecaster has settled on model (18.43), there are two choices for 
forecasting y,, 4. The first is to use ôo + Q@yY,+1 + ViZ.+1, Where the parameters are estimated using 
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the first n observations. The second possibility is to reestimate the parameters using all n + 1 obser- 
vations and then to use the same formula to forecast y„+2. To forecast in subsequent time periods, we 
can generally use the parameter estimates obtained from the initial n observations, or we can update 
the regression parameters each time we obtain a new data point. Although the latter approach requires 
more computation, the extra burden is relatively minor, and it can (although it need not) work better 
because the regression coefficients adjust at least somewhat to the new data points. 

As a specific example, suppose we wish to forecast the unemployment rate for 2012, using the 
model with a single lag of unem and inf. The first possibility is to just plug the 2011 values of unem- 
ployment and inflation into the right-hand side of (18.49). With unem; = 8.9 and info; = 3.2, we 
have a forecast for unem ,,, of about 1.085 + .719(8.9) + .158(3.2) ~ 7.99, which we can round 
to 8.0. The actual unemployment outcome for 2012 was 8.1, and so the forecast is very close to the 
actual outcome. The second possibility is to reestimate the equation by adding the 2011 observation 
and then using the new equation to forecast unemy ;,. Rounded to two decimal places, the forecast 
is 8. 06, which when rounded to 8. | produces the actual outcome for 2012. The forecasts are close 
because adding one more year of data only slightly changes the OLS estimates from equation (18.49). 

The model in equation (18.43) is one equation in what is known as a vector autoregressive 
(VAR) model. We know what an autoregressive model is from Chapter 11: we model a single series, 
{y,}, in terms of its own past. In vector autoregressive models, we model several series—which, if you 
are familiar with linear algebra, is where the word “vector” comes from—in terms of their own past. 
If we have two series, y, and z,, a vector autoregression consists of equations that look like 


Yi = Oy F AY 1 FV 1 E MY 2 F YZ- Fo [18.50] 
and 


Zt = No + BYi-1 + Pi%—1 + Boy,-2 E PZ- Hv? 


where each equation contains an error that has zero expected value given past information on y and z. 
In equation (18.43)—and in the example estimated in (18.49)—we assumed that one lag of each vari- 
able captured all of the dynamics. (An F test for joint significance of unem, —, and inf,—, confirms that 
only one lag of each is needed.) 

As Example 18.8 illustrates, VAR models can be useful for forecasting. In many cases, we are 
interested in forecasting only one variable, y, in which case we only need to estimate and analyze the 
equation for y. Nothing prevents us from adding other lagged variables, say, w,—1, W,—2, - - . , to equa- 
tion (18.50). Such equations are efficiently estimated by OLS, provided we have included enough lags 
of all variables and the equation satisfies the homoskedasticity assumption for time series regressions. 

Equations such as (18.50) allow us to test whether, after controlling for past values of y, past val- 
ues of z help to forecast y,. Generally, we say that z Granger causes y if 


E(y|Z,-1) + E(yJ;-1), [18.51] 


where /,_ , contains past information on y and z, and J,_,; contains only information on past y. When 
(18.51) holds, past z is useful, in addition to past y, for predicting y,. The term “causes” in “Granger 
causes” should be interpreted with caution. The only sense in which z “causes” y is given in (18.51). 
In particular, it has nothing to say about contemporaneous causality between y and z, so it does not 
allow us to determine whether z, is an exogenous or endogenous variable in an equation relating y, 
to z,. (This is also why the notion of Granger causality does not apply in pure cross-sectional 
contexts.) 

Once we assume a linear model and decide how many lags of y should be included in 
E(y|y,- 1, ;-2» - - -), we can easily test the null hypothesis that z does not Granger cause y. To be more 
specific, suppose that E(y,ly,_ 1, y,-2, - . .) depends on only three lags: 


Yi = Op + AY,-1 F Ay,-2 + Asy,-3 + Uy 
E(uly,—1, Yt- .) — 0. 
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Now, under the null hypothesis that z does not Granger cause y, any lags of z that we add to the equa- 
tion should have zero population coefficients. If we add z,_,, then we can simply do a f test on z,_,. 
If we add two lags of z, then we can do an F test for joint significance of z,_, and z,_, in the equation 


Yi = 69 + AY 1 H AWsY,-2 F OY, 3 F YiZr-1 E V2%-2 F Up 


(If there is heteroskedasticity, we can use a robust form of the test. There cannot be serial correlation 
under Hy because the model is dynamically complete.) 

As a practical matter, how do we decide on which lags of y and z to include? First, we start by 
estimating an autoregressive model for y and performing f and F tests to determine how many lags of y 
should appear. With annual data, the number of lags is typically small, say, one or two. With quarterly 
or monthly data, there are usually many more lags. Once an autoregressive model for y has been chosen, 
we can test for lags of z. The choice of lags of z is less important because, when z does not Granger cause 
y, no set of lagged z’s should be significant. With annual data, 1 or 2 lags are typically used; with quar- 
terly data, usually 4 or 8; and with monthly data, perhaps 6, 12, or maybe even 24, given enough data. 

We have already done one example of testing for Granger causality in equation (18.49). The 
autoregressive model that best fits unemployment is an AR(1). In equation (18.49), we added a single 
lag of inflation, and it was very significant. Therefore, inflation Granger causes unemployment. 

There is an extended definition of Granger causality that is often useful. Let {w,} be a third series 
(or, it could represent several additional series). Then, z Granger causes y conditional on w if (18.51) 
holds, but now /,_,; contains past information on y, z, and w, while J,_, contains past information 
on y and w. It is certainly possible that z Granger causes y, but z does not Granger cause y conditional 
on w. A test of the null that z does not Granger cause y conditional on w is obtained by testing for sig- 
nificance of lagged z in a model for y that also depends on lagged y and lagged w. For example, to test 
whether growth in the money supply Granger causes growth in real GDP, conditional on the change 
in interest rates, we would regress gGDP, on lags of gGDP, Aint, and gM and do significance tests on 
the lags of gM. [See, for example, Stock and Watson (1989). ] 


18-5c Comparing One-Step-Ahead Forecasts 


In almost any forecasting problem, there are several competing methods for forecasting. Even when 
we restrict attention to regression models, there are many possibilities. Which variables should be 
included, and with how many lags? Should we use logs, levels of variables, or first differences? 

In order to decide on a forecasting method, we need a way to choose which one is most suitable. 
Broadly, we can distinguish between in-sample criteria and out-of-sample criteria. In a regression 
context, in-sample criteria include R-squared and especially adjusted R-squared. There are many other 
model selection statistics, but we will not cover those here [see, for example, Ramanathan (1995, 
Chapter 4)]. 

For forecasting, it is better to use out-of-sample criteria, as forecasting is essentially an out-of- 
sample problem. A model might provide a good fit to y in the sample used to estimate the parameters. 
But this need not translate to good forecasting performance. An out-of-sample comparison involves 
using the first part of a sample to estimate the parameters of the model and saving the latter part of the 
sample to gauge its forecasting capabilities. This mimics what we would have to do in practice if we 
did not yet know the future values of the variables. 

Suppose that we have n + m observations, where we use the first n observations to estimate the 
parameters in our model and save the last m observations for forecasting. Let fis n be the one-step- 
ahead forecast of y, +;,,forh = 0,1,...,m — 1. The m forecast errors are @, 4,4) = Yntne1 T I om 
How should we measure how well our model forecasts y when it is out of sample? Two measures are 
most common. The first is the root mean squared error (RMSE): 


m=1 1/2 
RMSE = (aE aaa ; [18.52] 
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This is essentially the sample standard deviation of the forecast errors (without any degrees of free- 
dom adjustment). If we compute RMSE for two or more forecasting methods, then we prefer the 
method with the smallest out-of-sample RMSE. 

A second common measure is the mean absolute error (MAE), which is the average of the 
absolute forecast errors: 


m=1 
MAE = m` D lènn [18.53] 


Again, we prefer a smaller MAE. Other possible criteria include minimizing the largest of the abso- 
lute values of the forecast errors. 


EXAMPLE 18.9 Out-of-Sample Comparisons of Unemployment Forecasts 


In Example 18.8, using the data in PHILLIPS, we found that equation (18.49) fits notably better over 
the years 1948 through 2010 than did equation (18.48). However, for the single year 2011, the fore- 
cast from (18.48) is much closer to the actual outcome. Now, we use the two models, still estimated 
using the data only through 2010, to compare one-step-ahead forecasts for 2011 through 2017. This 
leaves seven out-of-sample observations (n = 62 and m = 7) to use in equations (18.52) and (18.53). 
For the AR(1) model for unem, RMSE = .606 and MAE = .515. For the model that adds lagged 
inflation (a VAR model of order one), RMSE = .394 and MAE = .327. Therefore, averaged over the 
seven out-of-sample time periods, the more general model fits substantially better based on both the 
RMSE and MAE—even though it does less well in the single year 2011. 


Rather than using only the first n observations to estimate the parameters of the model, we can 
reestimate the models each time we add a new observation and use the new model to forecast the next 
time period. 


18-5d Multiple-Step-Ahead Forecasts 


Forecasting more than one period ahead is generally more difficult than forecasting one period ahead. 
We can formalize this as follows. Suppose we consider forecasting y,, , at time ¢ and at an earlier time 
period s (so that s < t). Then Var[y,.; — E(y,+,\Z,)] = Varly,4, — E(,+,|/,)], where the inequality 
is usually strict. We will not prove this result generally, but, intuitively, it makes sense: the forecast 
error variance in predicting y,,, is larger when we make that forecast based on less information. 

If {y,} follows an AR(1) model (which includes a random walk, possibly with drift), we can eas- 
ily show that the error variance increases with the forecast horizon. The model is 


y= ao + PY1-1 + U, 
E(u,ll, 1) = 0, l-1 = DY- Yi- J; 


and {u,} has constant variance o° conditional on J,_,. At time t + h — 1, our forecast of y,,,, is 
a@ + py,,;,—1, and the forecast error is simply u,,,. Therefore, the one-step-ahead forecast variance is 
simply o*. To find multiple-step-ahead forecasts, we have, by repeated substitution, 


Yan = (+p: pa + py, 
T p" u + p una Fo + lp 


At time ¢, the expected value of u,,;, for all j = 1, is zero. So 


E(y+all) = (1 +p ++ p'a + p’y, [18.54] 
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and the forecast error is e,, = p" 'u,,,; + p"” *u,.. +--+ u,4,. This is a sum of uncor- 
related random variables, and so the variance of the sum is the sum of the variances: 
Var(e, ,) = o7[p™-? + p”? +--+ p? + 1]. Because p? > 0, each term multiplying g? is posi- 
tive, so the forecast error variance increases with h. When p* < 1, as h gets large the forecast variance 
converges to 0/(1 — p°), which is just the unconditional variance of y,. In the case of a random walk 
(p = 1),f., = ah + y, and Var(e, ,,) = o7h: the forecast variance grows without bound as the hori- 
zon h increases. This demonstrates that it is very difficult to forecast a random walk, with or without 
drift, far out into the future. For example, forecasts of interest rates farther into the future become 
dramatically less precise. 

Equation (18.54) shows that using the AR(1) model for multistep forecasting is easy, once we 
have estimated p by OLS. The forecast of y,,,;, at time n is 


fin = (1 + pte + pia + "y, [18.55] 


Obtaining forecast intervals is harder, unless h = 1, because obtaining the standard error of foh is dif- 
ficult. Nevertheless, the standard error of fa is usually small compared with the standard deviation of 
the error term, and the latter can be estimated as 6[p”"~) + p»? + --- + p? + 1]!?, where ô is the 
standard error of the regression from the AR(1) estimation. We can use this to obtain an approximate 
confidence interval. For example, when h = 2, an approximate 95% confidence interval (for large n) is 


fro = 1.966(1 + p?)!”. [18.56] 


Because we are underestimating the standard deviation of y, ,;, this interval is too narrow, but perhaps 
not by much, especially if n is large. 

A less traditional, but useful, approach is to estimate a different model for each forecast horizon. 
For example, suppose we wish to forecast y two periods ahead. If Z, depends only on y through time 
t, we might assume that E(y,,.l/,) = a + yıy, [which, as we saw earlier, holds if {y,} follows an 
AR(1) model]. We can estimate ag and y; by regressing y, on an intercept and on y,_». Even though 
the errors in this equation contain serial correlation—errors in adjacent periods are correlated—we 
can obtain consistent and approximately normal estimators of ag and y,. The forecast of y,,,, at time 
nis simply i. 2 = Ôo + ¥,y,. Further, and very importantly, the standard error of the regression is just 
what we need for computing a confidence interval for the forecast. Unfortunately, to get the standard 
error of hi. 2 using the trick for a one-step-ahead forecast requires us to obtain a serial correlation— 
robust standard error of the kind described in Section 12-5. This standard error goes to zero as n gets 
large while the variance of the error is constant. Therefore, we can get an approximate interval by 
using (18.56) and by putting the SER from the regression of y, on y,_> in place of é(1 + p”)!”. But 
we should remember that this ignores the estimation error in @ and 7. 

We can also compute multiple-step-ahead forecasts with more complicated autoregressive 
models. For example, suppose {y,} follows an AR(2) model and that at time n, we wish to forecast 
Yn+2- NOW, Yn+2 = & + PiYnt1 + Pn + Un+2, SO 


E(Yn+ alln) =ar PiE(¥n +12) + Pn: 


We can write this as 


Ía2 5 A F Pifai + Pon 


so that the two-step-ahead forecast at time n can be obtained once we get the one-step-ahead forecast. 
If the parameters of the AR(2) model have been estimated by OLS, then we operationalize this as 


Ta =, a + Bild + Pon [18.57] 


Now, faa = Â + p,y, + Poy,-1, Which we can compute at time n. Then, we plug this into (18.57), 
along with y,, to obtain f„2. For any h > 2, obtaining any h-step-ahead forecast for an AR(2) model is 
easy to find in a recursive manner: fan = @ + Pifnn—1 + Prfn.n—2- 
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Similar reasoning can be used to obtain multiple-step-ahead forecasts for VAR models. To illus- 
trate, suppose we have 


Yi = Ôo + MY + YZ- + Uy [18.58] 
and 

Zi = No + Yii + Pim—-1 + Ve 
Now, if we wish to forecast y,,, at time n, we simply use Fa = 5, + Gy, + ViZ,- Likewise, the 
forecast of z,,,, at time n is (say) 8,1; = No + BiYn + PiZn. Now, suppose we wish to obtain a two- 


step-ahead forecast of y at time n. From (18.58), we have 


E(yn+2Ln) = ôo + a E(yn + in) + ViE(Zn+ lla) 


[because E(u,,.>|J,,) = 0], so we can write the forecast as 


haz = 6 + âi fan + Pina [18.59] 


This equation shows that the two-step-ahead forecast for y depends on the one-step-ahead 
forecasts for y and z. Generally, we can build up multiple-step-ahead forecasts of y by using the 
recursive formula 


Jah = Ôo + fah- ı + Pinha- h=2. 


> ON SEAL Two-Year-Ahead Forecast for the Unemployment Rate 


To use equation (18.49) to forecast unemployment two years out—say, the 2012 rate using the 
data through 2010—we need a model for inflation. The best model for inf in terms of lagged 
unem and inf appears to be a simple AR(1) model (unem_, is not close to being statistically sig- 
nificant when added to the regression). Estimating an AR(1) model using the data through 2010 
gives 
inf, = 1.149 + .666 inf; 
(0.446) (.094) 
n = 62, R = 448, 


If we plug inf>o;) = 1.6 into the right-hand side of this equation, we get the forecast of inf for 2011: 
info, = 1.149 + .666(1.6) ~ 2.21. 


Next, we can plug hon along with unem = 8. 24 [which we obtained earlier using equation 
(18.49)], into (18.49) to forecast unemgz: 


unem = 1.085 + .719(8.24) + .158(2.21) = 7.36. 


The one-step-ahead forecast of unemy 9,7, obtained by plugging the 2011 values of unem and inf into 
(18.49), is about 7.99. The actual unemployment rate in 2012 was 8.1%, which means that the one- 
step-ahead forecast is, in this case, better than the two-step ahead forecast. We expect this to be usu- 
ally true because the one-step-ahead forecast is based on more (recent) information. 


Just as with one-step-ahead forecasting, an out-of-sample root mean squared error or a mean 
absolute error can be used to choose among multiple-step-ahead forecasting methods. 
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18-5e Forecasting Trending, Seasonal, and Integrated Processes 


We now turn to forecasting series that either exhibit trends, have seasonality, or have unit roots. Recall 
from Chapters 10 and 11 that one approach to handling trending dependent or independent variables 
in regression models is to include time trends, the most popular being a linear trend. Trends can be 
included in forecasting equations as well, although they must be used with caution. 

In the simplest case, suppose that {y,} has a linear trend but is unpredictable around that trend. 
Then, we can write 


yt =a + Bt + Uj, E(u|,—1) = 0, t= 1, 2, Rao y [18.60] 


where, as usual, /,_; contains information observed through time t — 1 (which includes 
at least past y). How do we forecast y,,, at time n for any h = 1? This is simple because 
E(y,,+,/J,) = œ + B(n + h). The forecast error variance is simply o° = Var(u,) (assuming a con- 
stant variance over time). If we estimate a and B by OLS using the first n observations, then our 
forecast for y, +, at time n is td = â + B(n + h). In other words, we simply plug the time period 
corresponding to y into the estimated trend function. For example, if we use the n = 131 observa- 
tions in BARIUM to forecast monthly imports of Chinese barium chloride to the United States from 
China, we obtain â = 249.56 and B = 5.15. The sample period ends in December 1988, so the 
forecast of imports of Chinese barium chloride six months later is 249.56 + 5.15(137) = 955.11, 
measured as short tons. For comparison, the December 1988 value is 1,087.81, so it is greater than 
the forecasted value six months later. The series and its estimated trend line are shown in Figure 18.2. 

As we discussed in Chapter 10, most economic time series are better characterized as having, at 
least approximately, a constant growth rate, which suggests that log(y,) follows a linear time trend. 
Suppose we use n observations to obtain the equation 


log(y,) = â + Êt, t = 1,2,...,n. [18.61] 


FIGURE 18.2 U.S. imports of Chinese barium chloride (in short tons) and its estimated linear 


trend line, 249.56 + 5.15 t. 
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Then, to forecast log(y) at any future time period 

n + h, we just plug n + h into the trend equation, as 
Suppose you model {y;;t = 1, 2,..., 46} | before. But this does not allow us to forecast y, which 
as a linear time trend, where data are annual | is usually what we want. It is tempting to simply expo- 
starting in 1950 and ending in 1995. Define | nentiate & + B(n + h) to obtain the forecast for y, + p 
Whe Velie She ecu eS) renginio nisn Winen but this is not quite right, for the same reasons we gave 
z ay ee z a a o in Section 6-4. We must properly account for the error 
§ compare with â and B in J, = & + pt? implicit in (18.61). The simplest way to do this is to use 
How will forecasts from the two equations | the n observations to regress y, on exp(/ogy,) without an 
compare? intercept. Let ¥ be the slope coefficient on exp(Jogy,). 
Then, the forecast of y in period n + h is simply 


fin = vexpla + B(n + h)]. [18.62] 


As an example, if we use the first 687 weeks of data on the New York Stock Exchange index in 
NYSE, we obtain â = 3.782 and B = .0019 [by regressing log(price,) on a linear time trend]; this 
shows that the index grows about .2% per week, on average. When we regress price on the expo- 
nentiated fitted values, we obtain Ẹ = 1.018. Now, we forecast price four weeks out, which is the 
last week in the sample, using (18.62): 1.018-exp[3.782 + .0019(691)] ~ 166.12. The actual value 
turned out to be 164.25, so we have somewhat overpredicted. But this result is much better than if 
we estimate a linear time trend for the first 687 weeks: the forecasted value for week 691 is 152.23, 
which is a substantial underprediction. 

Although trend models can be useful for prediction, they must be used with caution, especially 
for forecasting far into the future integrated series that have drift. The potential problem can be seen 
by considering a random walk with drift. At time ¢ + h, we can write y,,;, as 


Yran = Bh + y, + yy HH a 


where £ is the drift term (usually 6 > 0), and each u,,; has zero mean given Z, and constant variance 
a°. As we saw earlier, the forecast of y,,, at time tis E(y,,,\J,) = Bh + y, and the forecast error vari- 
ance is 07h. What happens if we use a linear trend model? Let yọ be the initial value of the process at 
time zero, which we take as nonrandom. Then, we can also write 


Yah = Yo + Blt A) + u, Py + + tgn 
= yo + B(t + h) F Vitr 


This looks like a linear trend model with the intercept a = yọ. But the error, v,+n, while having mean 
zero, has variance o*(t + h). Therefore, if we use the linear trend yọ + B(t + h) to forecast y,,;, at 
time z, the forecast error variance is o°(t + h), compared with o7h when we use Bh + y,. The ratio of 
the forecast variances is (t + h)/h, which can be big for large t. The bottom line is that we should not 
use a linear trend to forecast a random walk with drift. (Computer Exercise C8 asks you to compare 
forecasts from a cubic trend line and those from the simple random walk model for the general fertil- 
ity rate in the United States.) 

Deterministic trends can also produce poor forecasts if the trend parameters are estimated using 
old data and the process has a subsequent shift in the trend line. Sometimes, exogenous shocks—such 
as the oil crises of the 1970s—can change the trajectory of trending variables. If an old trend line is 
used to forecast far into the future, the forecasts can be way off. This problem can be mitigated by 
using the most recent data available to obtain the trend line parameters. 

Nothing prevents us from combining trends with other models for forecasting. For example, we 
can add a linear trend to an AR(1) model, which can work well for forecasting series with linear 
trends but which are also stable AR processes around the trend. 

It is also straightforward to forecast processes with deterministic seasonality (monthly or quar- 
terly series). For example, the file BARIUM contains the monthly production of gasoline in the 
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United States from 1978 through 1988. This series has no obvious trend, but it does have a strong sea- 
sonal pattern. (Gasoline production is higher in the summer months and in December.) In the simplest 
model, we would regress gas (measured in gallons) on 11 month dummies, say, for February through 
December. Then, the forecast for any future month is simply the intercept plus the coefficient on the 
appropriate month dummy. (For January, the forecast is just the intercept in the regression.) We can 
also add lags of variables and time trends to allow for general series with seasonality. 

Forecasting processes with unit roots also deserves special attention. Earlier, we obtained the 
expected value of a random walk conditional on information through time n. To forecast a random 
walk, with possible drift a, periods into the future at time n, we use fa = Ân + y,, Where @ is 
the sample average of the Ay, up through ż = n. (If there is no drift, we set & = 0.) This approach 
imposes the unit root. An alternative would be to estimate an AR(1) model for {y,} and to use the 
forecast formula (18.55). This approach does not impose a unit root, but if one is present, 6 converges 
in probability to one as n gets large. Nevertheless, 6 can be substantially different than one, especially 
if the sample size is not very large. The matter of which approach produces better out-of-sample fore- 
casts is an empirical issue. If in the AR(1) model, p is less than one, even slightly, the AR(1) model 
will tend to produce better long-run forecasts. 

Generally, there are two approaches to producing forecasts for I(1) processes. The first is to 
impose a unit root. For a one-step-ahead forecast, we obtain a model to forecast the change in y, Ay, , ;, 
given information through time t. Then, because y,,, = Ay,4,; + yp Eyri) = Eyl) + y- 
Therefore, our forecast of y,,,, at time n is just 


f, = Ên T Yw 
where @, is the forecast of Ay, ,, at time n. Typically, an AR model (which is necessarily stable) is 


used for Ay,, or a vector autoregression. 
This can be extended to multiple-step-ahead forecasts by writing y,,,;, as 


Yn+h 7 (Vath <a Yn+h-1) T (Yn+n-1 ~ Yn+h—2) Plek (Yn+1 ~ Yn) T Yn 


or 
Yath T AYn+h + AYn+h-1 ape oe AYn+1 + Yn: 
Therefore, the forecast of y„+, at time n is 
Sun = Sinn + nhl apse op Bn + Yw [18.63] 


where ĝ, ; is the forecast of Ay,,,; at time n. For example, we might model Ay, as a stable AR(1), 
obtain the multiple-step-ahead forecasts from (18.55) (but with â and p obtained from Ay, on Ay,_, 
and y, replaced with Ay,), and then plug these into (18.63). 

The second approach to forecasting I(1) variables is to use a general AR or VAR model for {y,}. 
This does not impose the unit root. For example, if we use an AR(2) model, 


Y 5 QA + pPyy,-1 + PYi-2 + Up [18.64] 


then p, + p, = 1. If we plug in p, = 1 — p, and rearrange, we obtain Ay, = a — p,Ay,_, + up 
which is a stable AR(1) model in the difference that takes us back to the first approach described earlier. 
Nothing prevents us from estimating (18.64) directly by OLS. One nice thing about this regression is 
that we can use the usual ¢ statistic on p, to determine if y,_, is significant. (This assumes that the homo- 
skedasticity assumption holds; if not, we can use the heteroskedasticity-robust form.) We will not show 
this formally, but, intuitively, it follows by rewriting the equation as y, = a + yy,_, — poAy,_; + Up 
where y = p, + p>. Even if y = 1, p, is minus the coefficient on a stationary, weakly dependent pro- 
cess {Ay,_,}. Because the regression results will be identical to (18.64), we can use (18.64) directly. 
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As an example, let us estimate an AR(2) model for the general fertility rate in FERTIL3, using 
the observations through 1979. (In Computer Exercise C8, you are asked to use this model for fore- 
casting, which is why we save some observations at the end of the sample.) 


afr, = 3.22 + 1.272 efr,_, — 311 afr, 
(2.92) (.120) (121) [18.65] 
n = 65, R? = .949, R? = .947. 


The ¢ statistic on the second lag is about —2.57, which is statistically different from zero at about the 
1% level. (The first lag also has a very significant f statistic, which has an approximate f distribution 
by the same reasoning used for p,.) The R-squared, adjusted or not, is not especially informative as a 
goodness-of-fit measure because gfr apparently contains a unit root, and it makes little sense to ask 
how much of the variance in gfr we are explaining. 

The coefficients on the two lags in (18.65) add up to .961, which is close to and not statistically 
different from one (as can be verified by applying the augmented Dickey-Fuller test to the equation 
Agfr, = a + Ogfr,_, + 6,Agfr,_; + u). Even though we have not imposed the unit root restriction, 
we can still use (18.65) for forecasting, as we discussed earlier. 

Before ending this section, we point out one potential improvement in forecasting in the con- 
text of vector autoregressive models with I(1) variables. Suppose {y,} and {z,} are each I(1) pro- 
cesses. One approach for obtaining forecasts of y is to estimate a bivariate autoregression in the 
variables Ay, and Az, and then to use (18.63) to generate one- or multiple-step-ahead forecasts; 
this is essentially the first approach we described earlier. However, if y, and z, are cointegrated, 
we have more stationary, stable variables in the information set that can be used in forecasting 
Ay: namely, lags of y, — Bz» where B is the cointegrating parameter. A simple error correction 
model is 


Ay, = a + Ay- + VAZ- + 8ly -1 — Be-1) + en [18.66] 
E(e,ll, 1) =, ` 
To forecast y, 4,, we use observations up through n to estimate the cointegrating parameter, 6, and 
then estimate the parameters of the error correction model by OLS, as described in Section 18-4. 
Forecasting Ay, ,, is easy: we just plug Ay,, Az,, and y, — Bz, into the estimated equation. Having 
obtained the forecast of Ay, ,,, we add it to y,. 
By rearranging the error correction model, we can write 


Yi = A + Piyi-1 + Pi-2 + Ô -1 + Ô -2 + Un [18.67] 


where p; = 1 + a, + ô, pa = —a,, and so on, which is the first equation in a VAR model for y, 
and z,. Notice that this depends on five parameters, just as many as in the error correction model. The 
point is that, for the purposes of forecasting, the VAR model in the levels and the error correction model 
are essentially the same. This is not the case in more general error correction models. For example, 
suppose that a, = y; = 0 in (18.66), but we have a second error correction term, 5(y,_. — z,—2). 
Then, the error correction model involves only four parameters, whereas (18.67)—which has the same 
order of lags for y and z—contains five parameters. Thus, error correction models can economize on 
parameters; that is, they are generally more parsimonious than VARs in levels. 

If y, and z, are I(1) but not cointegrated, the appropriate model is (18.66) without the error correc- 
tion term. This can be used to forecast Ay, ,,, and we can add this to y, to forecast y,, +1. 
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Summary 


The time series topics covered in this chapter are used routinely in empirical macroeconomics, empirical 
finance, and a variety of other applied fields. We began by showing how infinite distributed lag models 
can be interpreted and estimated. These can provide flexible lag distributions with fewer parameters than a 
similar finite distributed lag model. The geometric distributed lag and, more generally, rational distributed 
lag models are the most popular. They can be estimated using standard econometric procedures on simple 
dynamic equations. 

Testing for a unit root has become very common in time series econometrics. If a series has a unit root, 
then, in many cases, the usual large sample normal approximations are no longer valid. In addition, a unit 
root process has the property that an innovation has a long-lasting effect, which is of interest in its own 
right. While there are many tests for unit roots, the Dickey-Fuller ¢ test—and its extension, the augmented 
Dickey-Fuller test—is probably the most popular and easiest to implement. We can allow for a linear trend 
when testing for unit roots by adding a trend to the Dickey-Fuller regression. 

When an I(1) series, y, is regressed on another I(1) series, x, there is serious concern about spurious 
regression, even if the series do not contain obvious trends. This has been studied thoroughly in the case of 
a random walk: even if the two random walks are independent, the usual f test for significance of the slope 
coefficient, based on the usual critical values, will reject much more than the nominal size of the test. In 
addition, the R tends to a random variable, rather than to zero (as would be the case if we regress the dif- 
ference in y, on the difference in x,). 

In one important case, a regression involving I(1) variables is not spurious, and that is when the series 
are cointegrated. This means that a linear function of the two I(1) variables is I(0). If y, and x, are I(1) but 
y, — x, is I(0), y, and x, cannot drift arbitrarily far apart. There are simple tests of the null of no cointegra- 
tion against the alternative of cointegration, one of which is based on applying a Dickey-Fuller unit root 
test to the residuals from a static regression. There are also simple estimators of the cointegrating parameter 
that yield ¢ statistics with approximate standard normal distributions (and asymptotically valid confidence 
intervals). We covered the leads and lags estimator in Section 18-4. 

Cointegration between y, and x, implies that error correction terms may appear in a model relating Ay, 
to Ax; the error correction terms are lags in y, — Bx, where 6 is the cointegrating parameter. A simple two- 
step estimation procedure is available for estimating error correction models. First, 6 is estimated using a 
static regression (or the leads and lags regression). Then, OLS is used to estimate a simple dynamic model 
in first differences that includes the error correction terms. 

Section 18-5 contained an introduction to forecasting, with emphasis on regression-based forecast- 
ing methods. Static models or, more generally, models that contain explanatory variables dated con- 
temporaneously with the dependent variable, are limited because then the explanatory variables need 
to be forecasted. If we plug in hypothesized values of unknown future explanatory variables, we obtain 
a conditional forecast. Unconditional forecasts are similar to simply modeling y, as a function of past 
information we have observed at the time the forecast is needed. Dynamic regression models, including 
autoregressions and vector autoregressions, are used routinely. In addition to obtaining one-step-ahead 
point forecasts, we also discussed the construction of forecast intervals, which are very similar to predic- 
tion intervals. 

Various criteria are used for choosing among forecasting methods. The most common performance 
measures are the root mean squared error and the mean absolute error. Both estimate the size of the average 
forecast error. It is most informative to compute these measures using out-of-sample forecasts. 

Multiple-step-ahead forecasts present new challenges and are subject to large forecast error variances. 
Nevertheless, for models such as autoregressions and vector autoregressions, multiple-step-ahead forecasts 
can be computed, and approximate forecast intervals can be obtained. 
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Forecasting trending and I(1) series requires special care. Processes with deterministic trends can be 
forecasted by including time trends in regression models, possibly with lags of variables. A potential draw- 
back is that deterministic trends can provide poor forecasts for long-horizon forecasts: once it is estimated, 
a linear trend continues to increase or decrease. The typical approach to forecasting an I(1) process is to 
forecast the difference in the process and to add the level of the variable to that forecasted difference. Alter- 
natively, vector autoregressive models can be used in the levels of the series. If the series are cointegrated, 
error correction models can be used instead. 


Key Terms 


Augmented Dickey-Fuller Test Granger Causality Rational Distributed Lag (RDL) 
Cointegration Infinite Distributed Lag (IDL) Model 
Conditional Forecast Model Root Mean Squared Error 
Dickey-Fuller Distribution Information Set (RMSE) 
Dickey-Fuller (DF) Test In-Sample Criteria Spurious Regression Problem 
Engle-Granger Test Leads and Lags Estimator Unconditional Forecast 
Engle-Granger Two-Step Loss Function Unit Roots 

Procedure Martingale Vector Autoregressive (VAR) 
Error Correction Model Martingale Difference Sequence Model 
Exponential Smoothing Mean Absolute Error (MAE) 
Forecast Error Multiple-Step-Ahead Forecast 
Forecast Interval One-Step-Ahead Forecast 
Geometric (or Koyck) Out-of-Sample Criteria 

Distributed Lag Point Forecast 


Problems 


1 Consider equation (18.15) with k = 2. Using the IV approach to estimating the y, and p, what would 
you use as instruments for y,_ ,? 


2 An interesting economic model that leads to an econometric model with a lagged dependent variable 
relates y, to the expected value of x,, say, x;, where the expectation is based on all observed information 
at time f — 1: 


Y, = Qo + a,x, + u, [18.68] 


A natural assumption on {u,} is that E(u|Z,.,) = 0, where J,_, denotes all information on y and x 
observed at time ¢ — 1; this means that E(y,J,_;) = a + ax. To complete this model, we need an 
assumption about how the expectation x; is formed. We saw a simple example of adaptive expectations 
in Section 11-2, where x; = x,_,;. A more complicated adaptive expectations scheme is 


Mp — Xa = ACG — 34-1), [18.69] 


where 0 < A < 1. This equation implies that the change in expectations reacts to whether last period’s 
realized value was above or below its expectation. The assumption 0 < A < 1 implies that the change 
in expectations is a fraction of last period’s error. 

(i) Show that the two equations imply that 


Y, = Ady + (1 — A)y 4g + Aap + uy, — (1 = Aui. 


[Hint: Lag equation (18.68) one period, multiply it by (1 — A), and subtract this from (18.68). 
Then, use (18.69).] 

Gi) Under E(u|Z,_,) = 0, {u,} is serially uncorrelated. What does this imply about the new errors, 
v, =u, — (1 — A)u,_1? 
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(iii) If we write the equation from part (i) as 


Yi = Bo + BY- + Box,-1 + Vp 


how would you consistently estimate the B;? 
(iv) Given consistent estimators of the 6, how would you consistently estimate A and a,? 


Suppose that {y,} and {z,} are I(1) series, but y, — Bz, is 1(0) for some B # 0. Show that for any 
ô # P, y, — 6z, must be I(1). 


Consider the error correction model in equation (18.37). Show that if you add another lag of the 
error correction term, y,_. — Px,_, the equation suffers from perfect collinearity. (Hint: Show that 
y,-2 — Bx,—2 is a perfect linear function of y,_, — Bx,—,, Ax,—,, and Ay,_ |.) 


Suppose the process {(x, y,): t = 0, 1,2,...} satisfies the equations 
y, = Bx, + u 
and 
Ax, = yAx,-1 + Vp 


where E(u,|/,_,) = E(v,/,_,) = 0, Z, contains information on x and y dated at time t — 1 and earlier, 
B + 0, and |y| < 1 [so that x, and therefore y, is I(1)]. Show that these two equations imply an error 
correction model of the form 


Ay, = y,Ax,—1 + (yı — Pia) + en 


where y; = By, 6 = —1, and e, = u, + Bv, (Hint: First subtract y,—; from both sides of the first equa- 
tion. Then, add and subtract 6x,_, from the right-hand side and rearrange. Finally, use the second 
equation to get the error correction model that contains Ax, _ |.) 


Using the monthly data in VOLAT, the following model was estimated: 
peip = 1.54 + .344 pcip_, + .074 pcip_, + .073 pcip_, + .031 pesp_ 
(.56) (.042) (.045) (.042) (.013) 
n = 554, R? = .174, R? = .168, 


where pcip is the percentage change in monthly industrial production, at an annualized rate, and pcsp 
is the percentage change in the Standard & Poor’s 500 Index, also at an annualized rate. 


(i) Ifthe past three months of pcip are zero and pcsp_, = 0, what is the predicted growth in 
industrial production for this month? Is it statistically different from zero? 
(ii) Ifthe past three months of pcip are zero but pcsp_; = 10, what is the predicted growth in 


industrial production? 
(iii) What do you conclude about the effects of the stock market on real economic activity? 


Let gM, be the annual growth in the money supply and let unem, be the unemployment rate. Assuming 
that unem, follows a stable AR(1) process, explain in detail how you would test whether gM Granger 
causes unem. 


Suppose that y, follows the model 


Yı =a + Ô- + uy, 
U, = pu,- + & 


E(e,l,— 1) =0, 


where J,_ contains y and z dated at t — 1 and earlier. 
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G) 


(ii) 


(iii) 


(iv) 


Show that E(y,,,\Z,) = (1 — p)a + py, + 8z, — p8)z,_. (Hint: Write u,_; = y,-, — 

œ — ız,- and plug this into the second equation; then, plug the result into the first equation 
and take the conditional expectation.) 

Suppose that you use n observations to estimate a, ô, and p. Write the equation for forecasting 


Ynt1+ 
Explain why the model with one lag of z and AR(1) serial correlation is a special case of the model 


Yi = Ay + PYr-1 + V1Z—-1 F V2%-2 + er 


What does part (iii) suggest about using models with AR(1) serial correlation for forecasting? 


9 Let {y,} be an I(1) sequence. Suppose that @, is the one-step-ahead forecast of Ay,,, and let 
T, = g, + y, be the one-step-ahead forecast of y, ,,. Explain why the forecast errors for forecasting 
Ay,,+, and y,,,, are identical. 


10 Consider the geometric distributed model in equation (18.8), written in estimating equation form as in 
equation (18.11): 


Yi = Qo + YZ, + py;-1 + Vp 


where v, = u, — pu,—1- 


(i) 
(ii) 
(iii) 


(iv) 
(v) 


Suppose that you are only willing to assume the sequential exogeneity assumption in (18.6). 
Why is z, generally correlated with v,? 

Explain why estimating (18.11) by IV, using instruments (z, z,_;), is generally inconsistent 
under (18.6). Using the IV estimator, can you test whether z, and v, are correlated? 

Evaluate the following proposal when only (18.6) holds: Estimate (18.11) by IV using 
instruments (z,_ 1, Z,—>). 

Explain what you gain by estimating (18.11) by 2SLS using instruments (Z, z,— 1, Z+—2). 

In equation (18.16), the estimating equation for a rational distributed lag model, how would you 
estimate the parameters under (18.6) only? Might there be some practical problems with your 
approach? 


Computer Exercises 


C1 Use the data in WAGEPRC for this exercise. Problem 5 in Chapter 11 gave estimates of a finite distrib- 
uted lag model of gprice on gwage, where 12 lags of gwage are used. 


G) 


(ii) 


(iii) 


Estimate a simple geometric DL model of gprice on gwage. In particular, estimate equation 
(18.11) by OLS. What are the estimated impact propensity and LRP? Sketch the estimated lag 
distribution. 

Compare the estimated IP and LRP to those obtained in Problem 5 in Chapter 11. How do the 
estimated lag distributions compare? 

Now, estimate the rational distributed lag model from (18.16). Sketch the lag distribution and 
compare the estimated IP and LRP to those obtained in part (ii). 


C2 Use the data in HSEINV for this exercise. 


(i) 


(ii) 
(iii) 


Test for a unit root in log(invpc), including a linear time trend and two lags of Alog(invpc,). 
Use a 5% significance level. 

Use the approach from part (i) to test for a unit root in log(price). 

Given the outcomes in parts (i) and (ii), does it make sense to test for cointegration between 
log(invpc) and log(price)? 


C3 Use the data in VOLAT for this exercise. 


(i) 
(ii) 


Estimate an AR(3) model for pcip. Now, add a fourth lag and verify that it is very insignificant. 
To the AR(3) model from part (i), add three lags of pcsp to test whether pcsp Granger causes 
pcip. Carefully, state your conclusion. 
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(iii) To the model in part (ii), add three lags of the change in i3, the three-month T-bill rate. Does 
pcsp Granger cause pcip conditional on past Ai3? 


C4 In testing for cointegration between gfr and pe in Example 18.5, add r to equation (18.32) to obtain the 
OLS residuals. Include one lag in the augmented DF test. The 5% critical value for the test is —4.15. 


C5 Use INTQRT for this exercise. 
(i) In Example 18.7, we estimated an error correction model for the holding yield on six-month 
T-bills, where one lag of the holding yield on three-month T-bills is the explanatory variable. We 
assumed that the cointegration parameter was one in the equation hy6, = a + Bhy3,_, + u, 
Now, add the lead change, Ahy3,, the contemporaneous change, Ahy3,_,, and the lagged 
change, Ahy3,_>, of hy3,_,. That is, estimate the equation 


hy6, = a + Bhy3,-, + doAhy3, + b Ahy3,-, + piAhy3 -2 + e, 


and report the results in equation form. Test Hy: 8 = 1 against a two-sided alternative. Assume 
that the lead and lag are sufficient so that {hy3,_} is strictly exogenous in this equation and do 
not worry about serial correlation. 

(ii) To the error correction model in (18.39), add Ahy3,_, and (hy6,_. — hy3,_3). Are these terms 
jointly significant? What do you conclude about the appropriate error correction model? 


C6 Use the data in PHILLIPS for this exercise. 

(i) Estimate the models represented in equations (18.48) and (18.49) using the data through 2015. 
Do the parameter estimates change much compared with (18.48) and (18.49)? 

(ii) Use the new equations to forecast unemy 9,6; round to two decimal places. Which equation 
produces a better forecast? 

(iii) Use the equation that includes inf,_ ,, estimated in part (1), to forecast une )7. You will need to 
obtain the 2016 values for unem and inf. Next, reestimate the parameters using the data through 
2016, and use the updated estimates to forecast uwneimy9,7. Does using the extra year of data to 
obtain the parameter estimates produce a better forecast? 

(iv) Use the equation that includes inf,_ ,, estimated in part (i), to obtain a two-step-ahead forecast 
of unem. You will need the forecast of unemngis from part (ii). Also, use an AR(1) model for 
inflation, using the data through 2015, to forecast 2016 inflation. 


C7 Use the data in BARIUM for this exercise. 

(i) Estimate the linear trend model chnimp, = a + Bt + u, using the first 119 observations (this 
excludes the last 12 months of observations for 1988). What is the standard error of the regression? 

(i) Now, estimate an AR(1) model for chnimp, again using all data but the last 12 months. Compare 
the standard error of the regression with that from part (i). Which model provides a better 
in-sample fit? 

(iii) Use the models from parts (i) and (11) to compute the one-step-ahead forecast errors for the 
12 months in 1988. (You should obtain 12 forecast errors for each method.) Compute and 
compare the RMSEs and the MAEs for the two methods. Which forecasting method works 
better out-of-sample for one-step-ahead forecasts? 

(iv) Add monthly dummy variables to the regression from part (i). Are these jointly significant? 
(Do not worry about the slight serial correlation in the errors from this regression when doing 
the joint test.) 


C8 Use the data in FERTIL3 for this exercise. 
(i) Graph gfr against time. Does it contain a clear upward or downward trend over the entire 
sample period? 
(ii) Using the data through 1979, estimate a cubic time trend model for gfr (that is, regress gfr on t, 
f, and f, along with an intercept). Comment on the R-squared of the regression. 
(iii) Using the model in part (ii), compute the mean absolute error of the one-step-ahead forecast 
errors for the years 1980 through 1984. 
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c9 


C10 


C11 


C12 


(iv) 


(v) 


(vi) 


(vii) 


Using the data through 1979, regress Agfr, on a constant only. Is the constant statistically 
different from zero? Does it make sense to assume that any drift term is zero, if we assume that 
gfr, follows a random walk? 

Now, forecast gfr for 1980 through 1984, using a random walk model: the forecast of gfr„+1 is 
simply gfr,. Find the MAE. How does it compare with the MAE from part (iii)? Which method 
of forecasting do you prefer? 

Now, estimate an AR(2) model for gfr, again using the data only through 1979. Is the second lag 
significant? 

Obtain the MAE for 1980 through 1984, using the AR(2) model. Does this more general model 
work better out-of-sample than the random walk model? 


Use CONSUMP for this exercise. 


(i) 


(ii) 
(iii) 


(iv) 


Let y, be real per capita disposable income. Use the data through 1989 to estimate the model 
y= a+ Pt + py- + uy 


and report the results in the usual form. 

Use the estimated equation from part (i) to forecast y in 1990. What is the forecast error? 
Compute the mean absolute error of the one-step-ahead forecasts for the 1990s, using the 
parameters estimated in part (1). 

Now, compute the MAE over the same period, but drop y,- from the equation. Is it better to 
include y,—; in the model or not? 


Use the data in INTQRT for this exercise. 


G) 


GD 


Gii) 


(iv) 


Using the data from all but the last four years (16 quarters), estimate an AR(1) model for Ar6,. 
(We use the difference because it appears that r6, has a unit root.) Find the RMSE of the one- 
step-ahead forecasts for Ar6, using the last 16 quarters. 

Now, add the error correction term spr;_; = r6,-; — r3,—, to the equation from part (i). (This 
assumes that the cointegrating parameter is one.) Compute the RMSE for the last 16 quarters. 
Does the error correction term help with out-of-sample forecasting in this case? 

Now, estimate the cointegrating parameter, rather than setting it to one. Use the last 16 quarters 
again to produce the out-of-sample RMSE. How does this compare with the forecasts from parts 
(i) and (11)? 

Would your conclusions change if you wanted to predict r6 rather than Ar6? Explain. 


Use the data in VOLAT for this exercise. 


(i) 


(ii) 
(iii) 


(iv) 


(v) 


Confirm that lsp500 = log(sp500) and lip = log(ip) appear to contain unit roots. Use Dickey- 
Fuller tests with four lagged changes and do the tests with and without a linear time trend. 

Run a simple regression of /sp500 on lip. Comment on the sizes of the f statistic and R-squared. 
Use the residuals from part (ii) to test whether /sp500 and lip are cointegrated. Use the standard 
Dickey-Fuller test and the ADF test with two lags. What do you conclude? 

Add a linear time trend to the regression from part (ii) and now test for cointegration using the 
same tests from part (iii). 

Does it appear that stock prices and real economic activity have a long-run equilibrium 
relationship? 


This exercise also uses the data from VOLAT. Computer Exercise C11 studies the long-run relation- 
ship between stock prices and industrial production. Here, you will study the question of Granger 
causality using the percentage changes. 


(i) 


(ii) 


Estimate an AR(3) model for pcip,, the percentage change in industrial production (reported at 
an annualized rate). Show that the second and third lags are jointly significant at the 2.5% level. 
Add one lag of pcsp, to the equation estimated in part (i). Is the lag statistically significant? 
What does this tell you about Granger causality between the growth in industrial production and 
the growth in stock prices? 


C13 


C14 


C15 


C16 
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(iii) Redo part (ii) but obtain a heteroskedasticity-robust f statistic. Does the robust test change your 
conclusions from part (ii)? 


Use the data in TRAFFIC2 for this exercise. These monthly data, on traffic accidents in California over 

the years 1981 to 1989, were used in Computer Exercise C11 in Chapter 10. 

(i) Using the standard Dickey-Fuller regression, test whether /totacc, has a unit root. Can you reject 
a unit root at the 2.5% level? 

(ii) Now, add two lagged changes to the test from part (i) and compute the augmented Dickey- 
Fuller test. What do you conclude? 

(iii) Add a linear time trend to the ADF regression from part (11). Now what happens? 

(iv) Given the findings from parts (i) through (iii), what would you say is the best characterization 
of [totacc,: an I(1) process or an I(0) process about a linear time trend? 

(v) Test the percentage of fatalities, prcfat,, for a unit root, using two lags in an ADF regression. In 
this case, does it matter whether you include a linear time trend? 


Use the data in MINWAGE.DTA for sector 232 to answer the following questions. 

(G) Confirm that wage232, and lemp232, are best characterized as I(1) processes. Use the 
augmented DF test with one lag of gwage232 and gemp232, respectively, and a linear time 
trend. Is there any doubt that these series should be assumed to have unit roots? 

Gi) Regress Jemp232, on Iwage232, and test for cointegration, both with and without a time trend, 
allowing for two lags in the augmented Engle-Granger test. What do you conclude? 

(iii) Now regress lemp232, on log of the real wage rate, Irwage232, = Iwage232, — Icpi,, and a time 
trend. Do you find cointegration? Are they “closer” to being cointegrated when you use real 
wages rather than nominal wages? 

(iv) What are some factors that might be missing from the cointegrating regression in part (iii)? 


This question asks you to study the so-called Beveridge Curve from the perspective of cointegration 

analysis. The U.S. monthly data from December 2000 through February 2012 are in BEVERIDGE. 

Gi) Test for a unit root in urate using the usual Dickey-Fuller test (with a constant) and the 
augmented DF with two lags of curate. What do you conclude? Are the lags of curate in the 
augmented DF test statistically significant? Does it matter to the outcome of the unit root test? 

(ii) Repeat part (i) but with the vacancy rate, vrate. 

(iii) Assuming that urate and vrate are both I(1), the Beveridge Curve, 


urate, = a + Bvrate + u, 


only makes sense if urate and vrate are cointegrated (with cointegrating parameter B < 0). Test 
for cointegration using the Engle-Granger test with no lags. Are urate and vrate cointegrated at 
the 10% significance level? What about at the 5% level? 

(iv) Obtain the leads and lags estimator with cvrate,, cvrate,_,, and cvrate,, , as the I(0) explanatory 
variables added to the equation in part (iii). Obtain the Newey-West standard error for B using 
four lags (so g = 4 in the notation of Section 12-5). What is the resulting 95% confidence 
interval for B? How does it compare with the confidence interval that is not robust to serial 
correlation (or heteroskedasticity)? 

(v) Redo the Engle-Granger test but with two lags in the augmented DF regression. What happens? What 
do you conclude about the robustness of the claim that urate and vrate are cointegrated?/emp232, 


Use the data in PHILLIPS for this exercise. 

G) Using all of the years—through 2017—run the regression Ainf, on inf,_ , (and an intercept) 
and test the null hypothesis that {inf,} is I(1) against the alternative that it is (0). At what 
significance level do you reject the null hypothesis? 

(ii) What is the estimated value of p from part (i)? Would you say it is practically different from one? 

(iii) Now run the augmented Dickey-Fuller regression by including Ainf,_, as a regressor. Do your 
conclusions from part (i) change? Does it appear Ainf,_ , needs to be in the regression? 


CHAPTER 1 © 


Carrying Out an Empirical 
Project 


n this chapter, we discuss the ingredients of a successful empirical analysis, with emphasis on 
completing a term project. In addition to reminding you of the important issues that have arisen 
throughout the text, we emphasize recurring themes that are important for applied research. We also 
provide suggestions for topics as a way of stimulating your imagination. Several sources of economic 


research and data are given as references. 


19-1 Posing a Question 


642 


The importance of posing a very specific question that, in principle, can be answered with data can- 
not be overstated. Without being explicit about the goal of your analysis, you cannot know where to 
begin. The widespread availability of rich data sets makes it tempting to launch into data collection 
based on half-baked ideas, but this is often counterproductive. It is likely that, without carefully for- 
mulating your hypotheses and the kind of model you will need to estimate, you will forget to collect 
information on important variables, obtain a sample from the wrong population, or collect data for the 
wrong time period. 

This does not mean that you should pose your question in a vacuum. Especially for a one-term 
project you cannot be too ambitious. Therefore, when choosing a topic, you should be reasonably sure 
that data sources exist that will allow you to answer your question in the allotted time. 

You need to decide what areas of economics or other social sciences interest you when selecting 
a topic. For example, if you have taken a course in labor economics, you have probably seen theories 
that can be tested empirically or relationships that have some policy relevance. Labor economists 
are constantly coming up with new variables that can explain wage differentials. Examples include 


CHAPTER 19 Carrying Out an Empirical Project 643 


quality of high school [Card and Krueger (1992) and Betts (1995)], amount of math and science taken 
in high school [Levine and Zimmerman (1995)], and physical appearance [Hamermesh and Biddle 
(1994), Averett and Korenman (1996), Biddle and Hamermesh (1998), and Hamermesh and Parker 
(2005)]. Researchers in state and local public finance study how local economic activity depends on 
economic policy variables, such as property taxes, sales taxes, level and quality of services (such as 
schools, fire, and police), and so on. [See, for example, White (1986), Papke (1987), Bartik (1991), 
Netzer (1992), and Mark, McGuire, and Papke (2000). ] 

Economists that study education issues are interested in determining how spending affects per- 
formance [Hanushek (1986)], whether attending certain kinds of schools improves performance [for 
example, Evans and Schwab (1995)], and what factors affect where private schools choose to locate 
[Downes and Greenstein (1996)]. 

Macroeconomists are interested in relationships between various aggregate time series, such as 
the link between growth in gross domestic product and growth in fixed investment or machinery [see 
De Long and Summers (1991)] or the effect of taxes on interest rates [for example, Peek (1982)]. 

There are certainly reasons for estimating models that are mostly descriptive. For example, prop- 
erty tax assessors use models (called hedonic price models) to estimate housing values for homes that 
have not been sold recently. This involves a regression model relating the price of a house to its char- 
acteristics (size, number of bedrooms, number of bathrooms, and so on). As a topic for a term paper, 
this is not very exciting: we are unlikely to learn much that is surprising, and such an analysis has no 
obvious policy implications. Adding the crime rate in the neighborhood as an explanatory variable 
would allow us to determine how important a factor crime is on housing prices, something that would 
be useful in estimating the costs of crime. 

Several relationships have been estimated using macroeconomic data that are mostly descriptive. 
For example, an aggregate saving function can be used to estimate the aggregate marginal propensity 
to save, as well as the response of saving to asset returns (such as interest rates). Such an analysis 
could be made more interesting by using time series data on a country that has a history of political 
upheavals and determining whether savings rates decline during times of political uncertainty. 

Once you decide on an area of research, there are a variety of ways to locate specific papers on 
the topic. The Journal of Economic Literature (JEL) has a detailed classification system in which 
each paper is given a set of identifying codes that places it within certain subfields of economics. The 
JEL also contains a list of articles published in a wide variety of journals, organized by topic, and it 
even contains short abstracts of some articles. 

Especially convenient for finding published papers on various topics are Internet services, such 
as EconLit, to which many universities subscribe. EconLit allows users to do a comprehensive search 
of almost all economics journals by author, subject, words in the title, and so on. The Social Sciences 
Citation Index is useful for finding papers on a broad range of topics in the social sciences, including 
popular papers that have been cited often in other published works. 

Google Scholar is an Internet search engine that can be very helpful for tracking down research 
on various topics or research by a particular author. This is especially true of work that has not been 
published in an academic journal or that has not yet been published. 

In thinking about a topic, you should keep some things in mind. First, for a question to be inter- 
esting, it does not need to have broad-based policy implications; rather, it can be of local interest. For 
example, you might be interested in knowing whether living in a fraternity at your university causes 
students to have lower or higher grade point averages. This may or may not be of interest to people 
outside your university, but it is probably of concern to at least some people within the university. On 
the other hand, you might study a problem that starts by being of local interest but turns out to have 
widespread interest, such as determining which factors affect, and which university policies can stem, 
alcohol abuse on college campuses. 

Second, it is very difficult, especially for a quarter or semester project, to do truly original 
research using the standard macroeconomic aggregates on the U.S. economy. For example, the ques- 
tion of whether money growth, government spending growth, and so on affect economic growth has 
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been and continues to be studied by professional macroeconomists. The question of whether stock or 
other asset returns can be systematically predicted using known information has, for obvious reasons, 
been studied pretty carefully. This does not mean that you should avoid estimating macroeconomic or 
empirical finance models, as even just using more recent data can add constructively to a debate. In 
addition, you can sometimes find a new variable that has an important effect on economic aggregates 
or financial returns; such a discovery can be exciting. 

The point is that exercises such as using a few additional years to estimate a standard Phillips 
curve or an aggregate consumption function for the U.S. economy, or some other large economy, are 
unlikely to yield additional insights, although they can be instructive for the student. Instead, you 
might use data on a smaller country to estimate a static or dynamic Phillips curve or a Beveridge 
curve (possibly allowing the slopes of the curves to depend on information known prior to the current 
time period), or to test the efficient markets hypothesis, and so on. 

At the nonmacroeconomic level, there are also plenty of questions that have been studied 
extensively. For example, labor economists have published many papers on estimating the return to 
education. This question is still studied because it is very important, and new data sets, as well as 
new econometric approaches, continue to be developed. For example, as we saw in Chapter 9, certain 
data sets have better proxy variables for unobserved ability than other data sets. (Compare WAGE] 
and WAGE2.) In other cases, we can obtain panel data or data from a natural experiment—see 
Chapter 13—that allow us to approach an old question from a different perspective. 

As another example, criminologists are interested in studying the effects of various laws on crime. 
The question of whether capital punishment has a deterrent effect has long been debated. Similarly, 
economists have been interested in whether taxes on cigarettes and alcohol reduce consumption (as 
always, in a ceteris paribus sense). As more years of data at the state level become available, a richer 
panel data set can be created, and this can help us better answer major policy questions. Plus, the 
effectiveness of fairly recent crime-fighting innovations—such as community policing—can be eval- 
uated empirically. 

While you are formulating your question, it is helpful to discuss your ideas with your classmates, 
instructor, and friends. You should be able to convince people that the answer to your question is 
of some interest. (Of course, whether you can persuasively answer your question is another issue, 
but you need to begin with an interesting question.) If someone asks you about your paper and you 
respond with “I’m doing my paper on crime” or “I’m doing my paper on interest rates,” chances are 
you have only decided on a general area without formulating a true question. You should be able to 
say something like “I’m studying the effects of community policing on city crime rates in the United 
States” or “I’m looking at how inflation volatility affects short-term interest rates in Brazil.” 


19-2 Literature Review 


All papers, even if they are relatively short, should contain a review of relevant literature. It is rare 
that one attempts an empirical project for which no published precedent exists. If you search through 
journals or use online search services such as EconLit to come up with a topic, you are already well 
on your way to a literature review. If you select a topic on your own—such as studying the effects of 
drug usage on college performance at your university—then you will probably have to work a little 
harder. But online search services make that work a lot easier, as you can search by keywords, by 
words in the title, by author, and so on. You can then read abstracts of papers to see how relevant they 
are to your own work. 

When doing your literature search, you should think of related topics that might not show up in 
a search using a handful of keywords. For example, if you are studying the effects of drug usage on 
wages or grade point average, you should probably look at the literature on how alcohol usage affects 
such factors. Knowing how to do a thorough literature search is an acquired skill, but you can get a 
long way by thinking before searching. 
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Researchers differ on how a literature review should be incorporated into a paper. Some like to 
have a separate section called “literature review,” while others like to include the literature review as 
part of the introduction. This is largely a matter of taste, although an extensive literature review prob- 
ably deserves its own section. If the term paper is the focus of the course—say, in a senior seminar 
or an advanced econometrics course—your literature review probably will be lengthy. Term papers at 
the end of a first course are typically shorter, and the literature reviews are briefer. 


19-3 Data Collection 


19-3a Deciding on the Appropriate Data Set 


Collecting data for a term paper can be educational, exciting, and sometimes even frustrating. You 
must first decide on the kind of data needed to answer your posed question. As we discussed in the 
introduction and have covered throughout this text, data sets come in a variety of forms. The most 
common kinds are cross-sectional, time series, pooled cross sections, and panel data sets. 

Many questions can be addressed using any of the data structures we have described. For exam- 
ple, to study whether more law enforcement lowers crime, we could use a cross section of cities, a 
time series for a given city, or a panel data set of cities—which consists of data on the same cities over 
two or more years. 

Deciding on which kind of data to collect often depends on the nature of the analysis. To answer 
questions at the individual or family level, we often only have access to a single cross section; typi- 
cally, these are obtained via surveys. Then, we must ask whether we can obtain a rich enough data set 
to do a convincing ceteris paribus analysis. For example, suppose we want to know whether families 
who save through individual retirement accounts (IRAs)—which have certain tax advantages—have 
less non-IRA savings. In other words, does IRA saving simply crowd out other forms of saving? 
There are data sets, such as the Survey of Consumer Finances, that contain information on various 
kinds of saving for a different sample of families each year. Several issues arise in using such a data 
set. Perhaps the most important is whether there are enough controls—including income, demograph- 
ics, and proxies for saving tastes—to do a reasonable ceteris paribus analysis. If these are the only 
kinds of data available, we must do what we can with them. 

The same issues arise with cross-sectional data on firms, cities, states, and so on. In most cases, 
it is not obvious that we will be able to do a ceteris paribus analysis with a single cross section. For 
example, any study of the effects of law enforcement on crime must recognize the endogeneity of law 
enforcement expenditures. When using standard regression methods, it may be very hard to complete 
a convincing ceteris paribus analysis, no matter how many controls we have. (See Section 19-4 for 
more discussion.) 

If you have read the advanced chapters on panel data methods, you know that having the same 
cross-sectional units at two or more different points in time can allow us to control for time-constant 
unobserved effects that would normally confound regression on a single cross section. Panel data sets 
are relatively hard to obtain for individuals or families—although some important ones exist, such as 
the Panel Study of Income Dynamics—but they can be used in very convincing ways. Panel data sets 
on firms also exist. For example, Compustat and the Center for Research in Security Prices (CRSP) 
manage very large panel data sets of financial information on firms. Easier to obtain are panel data 
sets on larger units, such as schools, cities, counties, and states, as these tend not to disappear over 
time, and government agencies are responsible for collecting information on the same variables each 
year. For example, the Federal Bureau of Investigation collects and reports detailed information on 
crime rates at the city level. Sources of data are listed at the end of this chapter. 

Data come in a variety of forms. Some data sets, especially historical ones, are available 
only in printed form. For small data sets, entering the data yourself from the printed source is 
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manageable and convenient. Sometimes, articles are published with small data sets—especially time 
series applications. These can be used in an empirical study, perhaps by supplementing the data with 
more recent years. 

Many data sets are available in electronic form. Various government agencies provide data on 
their websites. Private companies sometimes compile data sets to make them user friendly, and then 
they provide them for a fee. Authors of papers are often willing to provide their data sets in electronic 
form. More and more data sets are available on the Internet. The web is a vast resource of online 
databases. Numerous websites containing economic and related data sets have been created. Several 
other websites contain links to data sets that are of interest to economists; some of these are listed at 
the end of this chapter. Generally, searching the Internet for data sources is easy and will become even 
more convenient in the future. 


19-3b Entering and Storing Your Data 


After you have decided on a data type and have located a data source, you must put the data into a 
usable format. If the data came in electronic form, they are already in some format, hopefully one in 
widespread use. The most flexible way to obtain data in electronic form is as a standard text (ASCII) 
file. All statistics and econometrics software packages allow raw data to be stored this way. Typically, 
it is straightforward to read a text file directly into an econometrics package, provided the file is 
properly structured. The data files we have used throughout the text provide several examples of how 
cross-sectional, time series, pooled cross sections, and panel data sets are usually stored. As a rule, 
the data should have a tabular form, with each observation representing a different row; the columns 
in the data set represent different variables. Occasionally, you might encounter a data set stored with 
each column representing an observation and each row a different variable. This is not ideal, but most 
software packages allow data to be read in this form and then reshaped. Naturally, it is crucial to know 
how the data are organized before reading them into your econometrics package. 

For time series data sets, there is only one sensible way to enter and store the data: namely, 
chronologically, with the earliest time period listed as the first observation and the most recent time 
period as the last observation. It is often useful to include variables indicating year and, if relevant, 
quarter or month. This facilitates estimation of a variety of models later on, including allowing for 
seasonality and breaks at different time periods. For cross sections pooled over time, it is usually best 
to have the cross section for the earliest year fill the first block of observations, followed by the cross 
section for the second year, and so on. (See FERTIL! as an example.) This arrangement is not crucial, 
but it is very important to have a variable stating the year attached to each observation. 

For panel data, as we discussed in Section 13-5, it is best if all the years for each cross-sectional 
observation are adjacent and in chronological order. With this ordering, we can use all of the panel 
data methods from Chapters 13 and 14. With panel data, it is important to include a unique identifier 
for each cross-sectional unit, along with a year variable. 

If you obtain your data in printed form, you have several options for entering them into a computer. 
First, you can create a text file using a standard text editor. (This is how several of the raw data sets 
included with the text were initially created.) Typically, it is required that each row starts a new obser- 
vation, that each row contains the same ordering of the variables—in particular, each row should 
have the same number of entries—and that the values are separated by at least one space. Sometimes, 
a different separator, such as a comma, is better, but this depends on the software you are using. 
If you have missing observations on some variables, you must decide how to denote that; simply 
leaving a blank does not generally work. Many regression packages accept a period as the missing 
value symbol. Some people prefer to use a number—presumably an impossible value for the variable 
of interest—to denote missing values. If you are not careful, this can be dangerous; we discuss this 
further later. 

If you have nonnumerical data—for example, you want to include the names in a sample of 
colleges or the names of cities—then you should check the econometrics package you will use to see 
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the best way to enter such variables (often called strings). Typically, strings are put between double or 
single quotation marks. Or the text file can follow a rigid formatting, which usually requires a small 
program to read in the text file. But you need to check your econometrics package for details. 

Another generally available option is to use a spreadsheet to enter your data, such as Excel. This 
has a couple of advantages over a text file. First, because each observation on each variable is a cell, it 
is less likely that numbers will be run together (as would happen if you forget to enter a space in a text 
file). Second, spreadsheets allow manipulation of data, such as sorting or computing averages. This ben- 
efit is less important if you use a software package that allows sophisticated data management; many 
software packages, including EViews and Stata, fall into this category. If you use a spreadsheet for 
initial data entry, then you must often export the data in a form that can be read by your econometrics 
package. This is usually straightforward, as spreadsheets export to text files using a variety of formats. 

A third alternative is to enter the data directly into your econometrics package. Although this 
obviates the need for a text editor or a spreadsheet, it can be more awkward if you cannot freely move 
across different observations to make corrections or additions. 

Data downloaded from the Internet may come in a variety of forms. Often data come as text files, 
but different conventions are used for separating variables; for panel data sets, the conventions on how 
to order the data may differ. Some Internet data sets come as spreadsheet files, in which case you must 
use an appropriate spreadsheet to read them. 


19-3c Inspecting, Cleaning, and Summarizing Your Data 


It is extremely important to become familiar with any data set you will use in an empirical analysis. 
If you enter the data yourself, you will be forced to know everything about it. But if you obtain data 
from an outside source, you should still spend some time understanding its structure and conventions. 
Even data sets that are widely used and heavily documented can contain glitches. If you are using a 
data set obtained from the author of a paper, you must be aware that rules used for data set construc- 
tion can be forgotten. 

Earlier, we reviewed the standard ways that various data sets are stored. You also need to know 
how missing values are coded. Preferably, missing values are indicated with a nonnumeric character, 
such as a period. If a number is used as a missing value code, such as “999” or “—1,” you must be 
very careful when using these observations in computing any statistics. Your econometrics package 
will probably not know that a certain number really represents a missing value: it is likely that such 
observations will be used as if they are valid, and this can produce rather misleading results. The best 
approach is to set any numerical codes for missing values to some other character (such as a period) 
that cannot be mistaken for real data. 

You must also know the nature of the variables in the data set. Which are binary variables? Which 
are ordinal variables (such as a credit rating)? What are the units of measurement of the variables? For 
example, are monetary values expressed in dollars, thousands of dollars, millions of dollars, or some 
other unit? Are variables representing a rate—such as school dropout rates, inflation rates, unioniza- 
tion rates, or interest rates—measured as a percentage or a proportion? 

Especially for time series data, it is crucial to know if monetary values are in nominal (current) or 
real (constant) dollars. If the values are in real terms, what is the base year or period? 

If you receive a data set from an author, some variables may already be transformed in certain 
ways. For example, sometimes only the log of a variable (such as wage or salary) is reported in the 
data set. Of course, this is fine if you plan to use the variable in logarithm form, but you may want to 
recreate the original variable for computing summary statistics. 

Detecting mistakes in a data set is necessary for preserving the integrity of any data analysis. 
It is always useful to find minimums, maximums, means, and standard deviations of all, or at least 
the most important, variables in the analysis. For example, if you find that the minimum value of 
education in your sample is —99, you know that at least one entry on education needs to be set to a 
missing value. If, upon further inspection, you find that several observations have —99 as the level of 
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education, you can be confident that you have stumbled onto the missing value code for education. 
As another example, if you find that an average murder conviction rate across a sample of cities 
is .632, you know that conviction rate is measured as a proportion, not a percentage. Then, if the 
maximum value is above one, this is likely a typographical error. (It is not uncommon to find data 
sets where most of the entries on a rate variable were entered as a percentage, but where some were 
entered as a proportion, and vice versa. Such data coding errors can be difficult to detect, but it is 
important to try.) 

We must also be careful in using time series data. If we are using monthly or quarterly data, we 
must know which variables, if any, have been seasonally adjusted. Transforming data also requires 
great care. Suppose we have a monthly data set and we want to create the change in a variable from 
one month to the next. To do this, we must be sure that the data are ordered chronologically, from 
earliest period to latest. If for some reason this is not the case, the differencing will result in garbage. 
To be sure the data are properly ordered, it is useful to have a time period indicator. With annual data, 
it is sufficient to know the year, but then we should know whether the year is entered as four digits 
or two digits (for example, 1998 versus 98). With monthly or quarterly data, it is also useful to have 
a variable or variables indicating month or quarter. With monthly data, we may have a set of dummy 
variables (11 or 12) or one variable indicating the month (1 through 12 or a string variable, such as 
jan, feb, and so on). 

With or without yearly, monthly, or quarterly indicators, we can easily construct time trends in all 
econometrics software packages. Creating seasonal dummy variables is easy if the month or quarter is 
indicated; at a minimum, we need to know the month or quarter of the first observation. 

Manipulating panel data can be even more challenging. In Chapter 13, we discussed pooled OLS 
on the differenced data as one general approach to controlling for unobserved effects. In construct- 
ing the differenced data, we must be careful not to create phantom observations. Suppose we have a 
balanced panel on cities from 1992 through 1997. Even if the data are ordered chronologically within 
each cross-sectional unit—something that should be done before proceeding—a mindless differenc- 
ing will create an observation for 1992 for all cities except the first in the sample. This observation 
will be the 1992 value for city i, minus the 1997 value for city i — 1; this is clearly nonsense. Thus, 
we must make sure that 1992 is missing for all differenced variables. 


19-4 Econometric Analysis 


This text has focused on econometric analysis, and we are not about to provide a review of econo- 
metric methods in this section. Nevertheless, we can give some general guidelines about the sorts of 
issues that need to be considered in an empirical analysis. 

As we discussed earlier, after deciding on a topic, we must collect an appropriate data set. 
Assuming that this has also been done, we must next decide on the appropriate econometric methods. 

If your course has focused on ordinary least squares estimation of a multiple linear regression 
model, using either cross-sectional or time series data, the econometric approach has pretty much 
been decided for you. This is not necessarily a weakness, as OLS is still the most widely used econo- 
metric method. Of course, you still have to decide whether any of the variants of OLS—such as 
weighted least squares or correcting for serial correlation in a time series regression—are warranted. 

In order to justify OLS, you must also make a convincing case that the key OLS assumptions are 
satisfied for your model. As we have discussed at some length, the first issue is whether the error term 
is uncorrelated with the explanatory variables. Ideally, you have been able to control for enough other 
factors to assume that those that are left in the error are unrelated to the regressors. Especially when 
dealing with individual-, family-, or firm-level cross-sectional data, the self-selection problem— 
which we discussed in Chapters 7 and 15—is often relevant. For instance, in the IRA example 
from Section 19-3, it may be that families with an unobserved taste for saving are also the ones that 
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open IRAs. You should also be able to argue that the other potential sources of endogeneity—namely, 
measurement error and simultaneity—are not a serious problem. 

When specifying your model you must also make functional form decisions. Should some vari- 
ables appear in logarithmic form? (In econometric applications, the answer is often yes.) Should some 
variables be included in levels and squares, to possibly capture a diminishing effect? How should 
qualitative factors appear? Is it enough to just include binary variables for different attributes or 
groups? Or do these need to be interacted with quantitative variables? (See Chapter 7 for details.) 

A common mistake, especially among beginners, is to incorrectly include explanatory varia- 
bles in a regression model that are listed as numerical values but have no quantitative meaning. For 
example, in an individual-level data set that contains information on wages, education, experience, 
and other variables, an “occupation” variable might be included. Typically, these are just arbitrary 
codes that have been assigned to different occupations; the fact that an elementary school teacher is 
given, say, the value 453 while a computer technician is, say, 751 is relevant only in that it allows us 
to distinguish between the two occupations. It makes no sense to include the raw occupational vari- 
able in a regression model. (What sense would it make to measure the effect of increasing occupa- 
tion by one unit when the one-unit increase has no quantitative meaning?) Instead, different dummy 
variables should be defined for different occupations (or groups of occupations, if there are many 
occupations). Then, the dummy variables can be included in the regression model. A less egregious 
problem occurs when an ordered qualitative variable is included as an explanatory variable. Suppose 
that in a wage data set a variable is included measuring “job satisfaction,” defined on a scale from 
1 to 7, with 7 being the most satisfied. Provided we have enough data, we would want to define a 
set of six dummy variables for, say, job satisfaction levels of 2 through 7, leaving job satisfaction 
level 1 as the base group. By including the six job satisfaction dummies in the regression, we allow 
a completely flexible relationship between the response variable and job satisfaction. Putting in the 
job satisfaction variable in raw form implicitly assumes that a one-unit increase in the ordinal vari- 
able has quantitative meaning. While the direction of the effect will often be estimated appropriately, 
interpreting the coefficient on an ordinal variable is difficult. If an ordinal variable takes on many 
values, then we can define a set of dummy variables for ranges of values. See Section 17-3 for 
an example. 

Sometimes, we want to explain a variable that is an ordinal response. For example, one could 
think of using a job satisfaction variable of the type described above as the dependent variable in a 
regression model, with both worker and employer characteristics among the independent variables. 
Unfortunately, with the job satisfaction variable in its original form, the coefficients in the model 
are hard to interpret: each measures the change in job satisfaction given a unit increase in the inde- 
pendent variable. Certain models—ordered probit and ordered logit are the most common—are well 
suited for ordered responses. These models essentially extend the binary probit and logit models we 
discussed in Chapter 17. [See Wooldridge (2010, Chapter 16) for a treatment of ordered response 
models.] A simple solution is to turn any ordered response into a binary response. For example, 
we could define a variable equal to one if job satisfaction is at least four, and zero otherwise. 
Unfortunately, creating a binary variable throws away information and requires us to use a somewhat 
arbitrary cutoff. 

For cross-sectional analysis, a secondary, but nevertheless important, issue is whether there is 
heteroskedasticity. In Chapter 8, we explained how this can be dealt with. The simplest way is to com- 
pute heteroskedasticity-robust statistics. 

As we emphasized in Chapters 10, 11, and 12, time series applications require additional care. 
Should the equation be estimated in levels? If levels are used, are time trends needed? Is differencing 
the data more appropriate? If the data are monthly or quarterly, does seasonality have to be accounted 
for? If you are allowing for dynamics—for example, distributed lag dynamics—how many lags 
should be included? You must start with some lags based on intuition or common sense, but eventu- 
ally it is an empirical matter. 
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If your model has some potential misspecification, such as omitted variables, and you use 
OLS, you should attempt some sort of misspecification analysis of the kinds we discussed in 
Chapters 3 and 5. Can you determine, based on reasonable assumptions, the direction of any bias in 
the estimators? 

If you have studied the method of instrumental variables, you know that it can be used to 
solve various forms of endogeneity, including omitted variables (Chapter 15), errors-in-variables 
(Chapter 15), and simultaneity (Chapter 16). Naturally, you need to think hard about whether the 
instrumental variables you are considering are likely to be valid. 

Good papers in the empirical social sciences contain sensitivity analysis. Broadly, this means 
you estimate your original model and modify it in ways that seem reasonable. Hopefully, the impor- 
tant conclusions do not change. For example, if you use as an explanatory variable a measure of 
alcohol consumption (say, in a grade point average equation), do you get qualitatively similar results 
if you replace the quantitative measure with a dummy variable indicating alcohol usage? If the binary 
usage variable is significant but the alcohol quantity variable is not, it could be that usage reflects 
some unobserved attribute that affects GPA and is also correlated with alcohol usage. But this needs 
to be considered on a case-by-case basis. 

If some observations are much different from the bulk of the sample—say, you have a few firms 
in a sample that are much larger than the other firms—do your results change much when those 
observations are excluded from the estimation? If so, you may have to alter functional forms to allow 
for these observations or argue that they follow a completely different model. The issue of outliers 
was discussed in Chapter 9. 

Using panel data raises some additional econometric issues. Suppose you have collected two 
periods. There are at least four ways to use two periods of panel data without resorting to instru- 
mental variables. You can pool the two years in a standard OLS analysis, as discussed in Chapter 13. 
Although this might increase the sample size relative to a single cross section, it does not control for 
time-constant unobservables. In addition, the errors in such an equation are almost always serially 
correlated because of an unobserved effect. Random effects estimation corrects the serial correlation 
problem and produces asymptotically efficient estimators, provided the unobserved effect has zero 
mean given values of the explanatory variables in all time periods. 

Another possibility is to include a lagged dependent variable in the equation for the second year. 
In Chapter 9, we presented this as a way to at least mitigate the omitted variables problem, as we are 
in any event holding fixed the initial outcome of the dependent variable. This often leads to similar 
results as differencing the data, as we covered in Chapter 13. 

With more years of panel data, we have the same options, plus an additional choice. We can use 
the fixed effects transformation to eliminate the unobserved effect. (With two years of data, this is 
the same as differencing.) In Chapter 15, we showed how instrumental variables techniques can be 
combined with panel data transformations to relax exogeneity assumptions even more. As a rule, it 
is a good idea to apply several reasonable econometric methods and compare the results. This often 
allows us to determine which of our assumptions are likely to be false. 

Even if you are very careful in devising your topic, postulating your model, collecting your 
data, and carrying out the econometrics, it is quite possible that you will obtain puzzling results— 
at least some of the time. When that happens, the natural inclination is to try different models, 
different estimation techniques, or perhaps different subsets of data until the results correspond 
more closely to what was expected. Virtually all applied researchers search over various models 
before finding the “best” model. Unfortunately, this practice of data mining violates the assump- 
tions we have made in our econometric analysis. The results on unbiasedness of OLS and other 
estimators, as well as the ¢ and F distributions we derived for hypothesis testing, assume that we 
observe a sample following the population model and we estimate that model once. Estimating 
models that are variants of our original model violates that assumption because we are using the 
same set of data in a specification search. In effect, we use the outcome of tests by using the 
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data to respecify our model. The estimates and tests from different model specifications are not 
independent of one another. 

Some specification searches have been programmed into standard software packages. A popular 
one is known as stepwise regression, where different combinations of explanatory variables are 
used in multiple regression analysis in an attempt to come up with the best model. There are vari- 
ous ways that stepwise regression can be used, and we have no intention of reviewing them here. 
The general idea is either to start with a large model and keep variables whose p-values are below 
a certain significance level or to start with a simple model and add variables that have significant 
p-values. Sometimes, groups of variables are tested with an F test. Unfortunately, the final model often 
depends on the order in which variables were dropped or added. [For more on stepwise regression, see 
Draper and Smith (1981).] In addition, this is a severe form of data mining, and it is difficult to inter- 
pret f and F statistics in the final model. One might argue that stepwise regression simply automates 
what researchers do anyway in searching over various models. However, in most applications, one or 
two explanatory variables are of primary interest, and then the goal is to see how robust the coefficients 
on those variables are to either adding or dropping other variables, or to changing functional form. 

In principle, it is possible to incorporate the effects of data mining into our statistical inference; 
in practice, this is very difficult and is rarely done, especially in sophisticated empirical work. [See 
Leamer (1983) for an engaging discussion of this problem.] But we can try to minimize data mining 
by not searching over numerous models or estimation methods until a significant result is found and 
then reporting only that result. If a variable is statistically significant in only a small fraction of the 
models estimated, it is quite likely that the variable has no effect in the population. 


19-5 Writing an Empirical Paper 


Writing a paper that uses econometric analysis is very challenging, but it can also be rewarding. 
A successful paper combines a careful, convincing data analysis with good explanations and expo- 
sition. Therefore, you must have a good grasp of your topic, good understanding of econometric 
methods, and solid writing skills. Do not be discouraged if you find writing an empirical paper 
difficult; most professional researchers have spent many years learning how to craft an empirical 
analysis and to write the results in a convincing form. 

While writing styles vary, many papers follow the same general outline. The following para- 
graphs include ideas for section headings and explanations about what each section should contain. 
These are only suggestions and hardly need to be strictly followed. In the final paper, each section 
would be given a number, usually starting with one for the introduction. 


19-5a Introduction 


The introduction states the basic objectives of the study and explains why it is important. It gener- 
ally entails a review of the literature, indicating what has been done and how previous work can be 
improved upon. (As discussed in Section 19-2, an extensive literature review can be put in a separate 
section.) Presenting simple statistics or graphs that reveal a seemingly paradoxical relationship is a 
useful way to introduce the paper’s topic. For example, suppose that you are writing a paper about 
factors affecting fertility in a developing country, with the focus on education levels of women. An 
appealing way to introduce the topic would be to produce a table or a graph showing that fertility has 
been falling (say) over time and a brief explanation of how you hope to examine the factors contrib- 
uting to the decline. At this point, you may already know that, ceteris paribus, more highly educated 
women have fewer children and that average education levels have risen over time. 

Most researchers like to summarize the findings of their paper in the introduction. This can be a 
useful device for grabbing the reader’s attention. For example, you might state that your best estimate 
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of the effect of missing 10 hours of lecture during a 30-hour term is about one-half a grade point. But 
the summary should not be too involved because neither the methods nor the data used to obtain the 
estimates have yet been introduced. 


19-5b Conceptual (or Theoretical) Framework 


In this section, you describe the general approach to answering the question you have posed. It can be 
formal economic theory, but in many cases, it is an intuitive discussion about what conceptual prob- 
lems arise in answering your question. 

Suppose you are studying the effects of economic opportunities and severity of punishment on 
criminal behavior. One approach to explaining participation in crime is to specify a utility maximiza- 
tion problem where the individual chooses the amount of time spent in legal and illegal activities, 
given wage rates in both kinds of activities, as well as variables measuring probability and severity of 
punishment for criminal activity. The usefulness of such an exercise is that it suggests which variables 
should be included in the empirical analysis; it gives guidance (but rarely specifics) as to how the 
variables should appear in the econometric model. 

Often, there is no need to write down an economic theory. For econometric policy analysis, com- 
mon sense usually suffices for specifying a model. For example, suppose you are interested in esti- 
mating the effects of participation in Aid to Families with Dependent Children (AFDC) on the effects 
of child performance in school. AFDC provides supplemental income, but participation also makes 
it easier to receive Medicaid and other benefits. The hard part of such an analysis is deciding on the 
set of variables that should be controlled for. In this example, we could control for family income 
(including AFDC and any other welfare income), mother’s education, whether the family lives in an 
urban area, and other variables. Then, the inclusion of an AFDC participation indicator (hopefully) 
measures the nonincome benefits of AFDC participation. A discussion of which factors should be 
controlled for and the mechanisms through which AFDC participation might improve school perfor- 
mance substitute for formal economic theory. 


19-5c Econometric Models and Estimation Methods 


It is very useful to have a section that contains a few equations of the sort you estimate and present in 
the results section of the paper. This allows you to fix ideas about what the key explanatory variable 
is and what other factors you will control for. Writing equations containing error terms allows you to 
discuss whether OLS is a suitable estimation method. 

The distinction between a model and an estimation method should be made in this section. 
A model represents a population relationship (broadly defined to allow for time series equations). 
For example, we should write 


colGPA = By + B,alcohol + B,hsGPA + B3SAT + By female + u [19.1] 


to describe the relationship between college GPA and alcohol consumption, with some other controls 
in the equation. Presumably, this equation represents a population, such as all undergraduates at a 
particular university. There are no “hats” (°) on the 6; or on colGPA because this is a model, not an 
estimated equation. We do not put in numbers for the £, because we do not know (and never will 
know) these numbers. Later, we will estimate them. In this section, do not anticipate the presentation 
of your empirical results. In other words, do not start with a general model and then say that you omit- 
ted certain variables because they turned out to be insignificant. Such discussions should be left for 
the results section. 

A time series model to relate city-level car thefts to the unemployment rate and conviction rates 
could look like 


thefts, = By + Byunem, + B unem,- + B3cars, 


+ B,convrate, + Bsconvrate,_, + u, 


[19.2] 
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where the f subscript is useful for emphasizing any dynamics in the equation (in this case, allowing 
for unemployment and the automobile theft conviction rate to have lagged effects). 

After specifying a model or models, it is appropriate to discuss estimation methods. In most 
cases, this will be OLS, but, for example, in a time series equation, you might use feasible GLS to 
do a serial correlation correction (as in Chapter 12). However, the method for estimating a model is 
quite distinct from the model itself. It is not meaningful, for instance, to talk about “an OLS model.” 
Ordinary least squares is a method of estimation, and so are weighted least squares, Cochrane-Orcutt, 
and so on. There are usually several ways to estimate any model. You should explain why the method 
you are choosing is warranted. 

Any assumptions that are used in obtaining an estimable econometric model from an underly- 
ing economic model should be clearly discussed. For example, in the quality of high school example 
mentioned in Section 19-1, the issue of how to measure school quality is central to the analysis. 
Should it be based on average SAT scores, percentage of graduates attending college, student- 
teacher ratios, average education level of teachers, some combination of these, or possibly other 
measures? 

We always have to make assumptions about functional form whether or not a theoretical model 
has been presented. As you know, constant elasticity and constant semi-elasticity models are attrac- 
tive because the coefficients are easy to interpret (as percentage effects). There are no hard rules on 
how to choose functional form, but the guidelines discussed in Section 6-2 seem to work well in prac- 
tice. You do not need an extensive discussion of functional form, but it is useful to mention whether 
you will be estimating elasticities or a semi-elasticity. For example, if you are estimating the effect 
of some variable on wage or salary, the dependent variable will almost surely be in logarithmic form, 
and you might as well include this in any equations from the beginning. You do not have to present 
every one, or even most, of the functional form variations that you will report later in the results 
section. 

Often, the data used in empirical economics are at the city or county level. For example, suppose 
that for the population of small to midsize cities, you wish to test the hypothesis that having a minor 
league baseball team causes a city to have a lower divorce rate. In this case, you must account for the 
fact that larger cities will have more divorces. One way to account for the size of the city is to scale 
divorces by the city or adult population. Thus, a reasonable model is 


log(div/pop) = Bo + Bımlb + B,perCath + B;log(inc/pop) 


+ other factors, 


[19.3] 


where mlb is a dummy variable equal to one if the city has a minor league baseball team and perCath 

is the percentage of the population that is Catholic (so a number such as 34.6 means 34.6%). Note that 

div/pop is a divorce rate, which is generally easier to interpret than the absolute number of divorces. 
Another way to control for population is to estimate the model 


log(div) = yo + yımlb + y2perCath + y3log(inc) + yslog(pop) [19.4] 


+ other factors. 


The parameter of interest, y,, when multiplied by 100, gives the percentage difference between 
divorce rates, holding population, percent Catholic, income, and whatever else is in “other factors” 
constant. In equation (19.3), 6, measures the percentage effect of minor league baseball on div/pop, 
which can change either because the number of divorces or the population changes. Using the fact 
that log(div/pop) = log(div) — log(pop) and log(inc/pop) = log(inc) — log(pop), we can rewrite 
(19.3) as 


log(div) = By + Bymlb + BoperCath + B,log(inc) + (1 — B;)log(pop) 
+ others factors, 
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which shows that (19.3) is a special case of (19.4) with y, = (1 — B;) and Y; = Bp j = 9, 1, 2, 3. 
Alternatively, (19.4) is equivalent to adding log(pop) as an additional explanatory variable to (19.3). 
This makes it easy to test for a separate population effect on the divorce rate. 

If you are using a more advanced estimation method, such as two stage least squares, you need 
to provide some reasons for doing so. If you use 2SLS, you must provide a careful discussion on why 
your IV choices for the endogenous explanatory variable (or variables) are valid. As we mentioned in 
Chapter 15, there are two requirements for a variable to be considered a good IV. First, it must be omit- 
ted from and exogenous to the equation of interest (structural equation). This is something we must 
assume. Second, it must have some partial correlation with the endogenous explanatory variable. This 
we can test. For example, in equation (19.1), you might use a binary variable for whether a student lives 
in a dormitory (dorm) as an IV for alcohol consumption. This requires that living situation has no direct 
impact on colGPA—so that it is omitted from (19.1)—and that it is uncorrelated with unobserved fac- 
tors in u that have an effect on colGPA. We would also have to verify that dorm is partially correlated 
with alcohol by regressing alcohol on dorm, hsGPA, SAT, and female. (See Chapter 15 for details.) 

You might account for the omitted variable problem (or omitted heterogeneity) by using panel 
data. Again, this is easily described by writing an equation or two. In fact, it is useful to show how to 
difference the equations over time to remove time-constant unobservables; this gives an equation that 
can be estimated by OLS. Or, if you are using fixed effects estimation instead, you simply state so. 

As a simple example, suppose you are testing whether higher county tax rates reduce economic 
activity, as measured by per capita manufacturing output. Suppose that for the years 1982, 1987, and 
1992, the model is 


log(manu i) = Bo + 6,d87, + 6,d92, + Bytax;, + +++ + a; + Uj, 


where d87, and d92, are year dummy variables and tax; is the tax rate for county i at time t (in percent 
form). We would have other variables that change over time in the equation, including measures for 
costs of doing business (such as average wages), measures of worker productivity (as measured by 
average education), and so on. The term a; is the fixed effect, containing all factors that do not vary 
over time, and u; is the idiosyncratic error term. To remove a; we can either difference across the 
years or use time-demeaning (the fixed effects transformation). 

Several of the previous examples serve as a reminder that using the natural log transformation 
is very common in empirical research. We discussed the advantages of doing so in Chapters 6 and 7. 
Occasionally, one is faced with an explanatory variable that varies widely over positive values and 
also can take on the value zero. For example, penbens might represent the value of pension benefits 
for a population of workers. In some pension plans, workers receive no benefits until they work a 
certain number of years, and so penbens is zero. For other workers, penbens might be substantial, 
especially if it is measured as a present value of a future benefit stream. Using penbens as an explana- 
tory variable may lead to estimates sensitive to small changes in the sample due to extreme values—a 
phenomenon we discussed in Section 9.5c. We cannot use log(penbens) because log(0) is not defined, 
and if we try to create log(penbens) any software package will insert the missing data indicator 
wherever penbens = 0. But a simple modification, log(1 + penbens), is well defined even when 
penbens = 0, and the natural logarithm has its usual benefit of compressing the range of a variable 
and likely reducing the sensitivity to outliers. Such a solution might be preferred to dropping observa- 
tions with “large” values of penbens, or using some abitrary rule to censor large values. 


19-5d The Data 


You should always have a section that carefully describes the data used in the empirical analysis. 
This is particularly important if your data are nonstandard or have not been widely used by other 
researchers. Enough information should be presented so that a reader could, in principle, obtain the 
data and redo your analysis. In particular, all applicable public data sources should be included in the 
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references, and short data sets can be listed in an appendix. If you used your own survey to collect the 
data, a copy of the questionnaire should be presented in an appendix. 

Along with a discussion of the data sources, be sure to discuss the units of each of the variables 
(for example, is income measured in hundreds or thousands of dollars?). Including a table of variable 
definitions is very useful to the reader. The names in the table should correspond to the names used in 
describing the econometric results in the following section. 

It is also very informative to present a table of summary statistics, such as minimum and maximum 
values, means, and standard deviations for each variable. Having such a table makes it easier to inter- 
pret the coefficient estimates in the next section, and it emphasizes the units of measurement of the var- 
iables. For binary variables, the only necessary summary statistic is the fraction of ones in the sample 
(which is the same as the sample mean). For trending variables, things like means are less interesting. 
It is often useful to compute the average growth rate in a variable over the years in your sample. 

You should always clearly state how many observations you have. For time series data sets, 
identify the years that you are using in the analysis, including a description of any special periods in 
history (such as World War II). If you use a pooled cross section or a panel data set, be sure to report 
how many cross-sectional units (people, cities, and so on) you have for each year. 


19-5e Results 


The results section should include your estimates of any models formulated in the models section. 
You might start with a very simple analysis. For example, suppose that percentage of students attend- 
ing college from the graduating class (percoll) is used as a measure of the quality of the high school a 
person attended. Then, an equation to estimate is 


log(wage) = By + B,percoll + u. 


Of course, this does not control for several other factors that may determine wages and that may be 
correlated with percoll. But a simple analysis can draw the reader into the more sophisticated analysis 
and reveal the importance of controlling for other factors. 

If only a few equations are estimated, you can present the results in equation form with standard 
errors in parentheses below estimated coefficients. If your model has several explanatory variables 
and you are presenting several variations on the general model, it is better to report the results in tabu- 
lar rather than equation form. Most of your papers should have at least one table, which should always 
include at least the R-squared and the number of observations for each equation. Other statistics, such 
as the adjusted R-squared, can also be listed. 

The most important thing is to discuss the interpretation and strength of your empirical results. 
Do the coefficients have the expected signs? Are they statistically significant? If a coefficient is sta- 
tistically significant but has a counterintuitive sign, why might this be true? It might be revealing a 
problem with the data or the econometric method (for example, OLS may be inappropriate due to 
omitted variables problems). 

Be sure to describe the magnitudes of the coefficients on the major explanatory variables. Often, 
one or two policy variables are central to the study. Their signs, magnitudes, and statistical significance 
should be treated in detail. Remember to distinguish between economic and statistical significance. If a 
t statistic is small, is it because the coefficient is practically small or because its standard error is large? 

In addition to discussing estimates from the most general model, you can provide interesting 
special cases, especially those needed to test certain multiple hypotheses. For example, in a study 
to determine wage differentials across industries, you might present the equation without the indus- 
try dummies; this allows the reader to easily test whether the industry differentials are statistically 
significant (using the R-squared form of the F test). Do not worry too much about dropping various 
variables to find the “best” combination of explanatory variables. As we mentioned earlier, this is a 
difficult and not even very well-defined task. Only if eliminating a set of variables substantially alters 
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the magnitudes and/or significance of the coefficients of interest is this important. Dropping a group 
of variables to simplify the model—such as quadratics or interactions—can be justified via an F test. 

If you have used at least two different methods—such as OLS and 2SLS, or levels and differenc- 
ing for a time series, or pooled OLS versus differencing with a panel data set—then you should com- 
ment on any critical differences. If OLS gives counterintuitive results, did using 2SLS or panel data 
methods improve the estimates? Or, did the opposite happen? 


19.5f Conclusions 


This can be a short section that summarizes what you have learned. For example, you might want to 
present the magnitude of a coefficient that was of particular interest. The conclusion should also dis- 
cuss caveats to the conclusions drawn, and it might even suggest directions for further research. It is 
useful to imagine readers turning first to the conclusion to decide whether to read the rest of the paper. 


19-5g Style Hints 


You should give your paper a title that reflects its topic, but make sure the title is not so long as to 
be cumbersome. The title should be on a separate title page that also includes your name, affiliation, 
and—if relevant—the course number. The title page can also include a short abstract, or an abstract 
can be included on a separate page. 

Papers should be typed and double-spaced. All equations should begin on a new line, and 
they should be centered and numbered consecutively, that is, (1), (2), (3), and so on. Large graphs 
and tables may be included after the main body. In the text, refer to papers by author and date, for 
example, White (1980). The reference section at the end of the paper should be done in standard 
format. Several examples are given in the references at the back of the text. 

When you introduce an equation in the econometric models section, you should describe the 
important variables: the dependent variable and the key independent variable or variables. To focus on 
a single independent variable, you can write an equation, such as 


GPA = B, + By,alcohol + x6 + u 
or 
log(wage) = By + Bieduc + xô + u, 


where the notation x6 is shorthand for several other explanatory variables. At this point, you need only 
describe them generally; they can be described specifically in the data section in a table. For example, in 
a study of the factors affecting chief executive officer salaries, you might include a table like Table 19.1. 

A table of summary statistics, obtained from Table I in Papke and Wooldridge (1996) and similar 
to the data in 401K, might be set up as shown in Table 19.2. 

In the results section, you can write the estimates either in equation form, as we often have done, 
or in a table. Especially when several models have been estimated with different sets of explanatory 
variables, tables are very useful. If you write out the estimates as an equation, for example, 


a SS 
log(salary) = 2.45 + .236 log(sales) + .008 roe + .061 ceoten 
(0.93) (.115) (.003) (.028) 
n = 204, R = 351, 
be sure to state near the first equation that standard errors are in parentheses. It is acceptable to report 


the ¢ statistics for testing Hy: 6; = 0, or their absolute values, but it is most important to state what 
you are doing. 
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TABLE 19.1 Variable Descriptions 


salary annual salary (including bonuses) in 1990 (in thousands) 
Sales firm sales in 1990 (in millions) 

roe average return on equity, 1988—1990 (in percent) 

pcsal percentage change in salary, 1988-1990 

pcroe percentage change in roe, 1988-1990 

indust = 1 if an industrial company, 0 otherwise 

finance = 1 if a financial company, 0 otherwise 

consprod = 1 if a consumer products company, 0 otherwise 

util = 1 if a utility company, 0 otherwise 

ceoten number of years as CEO of the company 


TABLE 19.2 Summary Statistics 


Variable Mean Standard Deviation Minimum Maximum 
prate 869 167 .023 1 
mrate 746 844 .011 5 
employ 4,621.01 16,299.64 53 443,040 
age 13.14 9.63 4 76 
sole A15 493 0 1 
Number of observations = 3,784 


If you report your results in tabular form, make sure the dependent and independent variables are 
clearly indicated. Again, state whether standard errors or t statistics are below the coefficients (with 
the former preferred). Some authors like to use asterisks to indicate statistical significance at different 
significance levels (for example, one star means significant at 5%, two stars mean significant at 10% 
but not 5%, and so on). This is not necessary if you carefully discuss the significance of the explana- 
tory variables in the text. 

A sample table of results, derived from Table II in Papke and Wooldridge (1996), is shown in 
Table 19.3. 

Your results will be easier to read and interpret if you choose the units of both your dependent 
and independent variables so that coefficients are not too large or too small. You should never report 
numbers such as 1.05le—007 or 3.524e +006 for your coefficients or standard errors; as a general 
rule, avoid scientific notation. If coefficients are either extremely small or large, rescale the dependent 
or independent variables, as we discussed in Chapter 6. You should limit the number of digits reported 
after the decimal point so as not to convey a false sense of precision. For example, if your regres- 
sion package estimates a coefficient to be 54821059, you should report this as .548, or even .55, in 
the paper. 

As a rule, the commands that your particular econometrics package uses to produce results 
should not appear in the paper; only the results are important. If some special command was used to 
carry out a certain estimation method, this can be given in an appendix. An appendix is also a good 
place to include extra results that support your analysis but are not central to it. 
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TABLE 19.3 OLS Results. Dependent Variable: Participation Rate 


Independent Variables (1) (2) (3) 
mrate 156 239 218 
(.012) (.042) (.342) 
mrate? —_ — .087 — .096 
(.043) (.073) 
log(emp) —.112 = —.098 
(.014) (.014) (111) 
log( emp)? .0057 .0057 .0052 
(.0009) (.0009) (.0007) 
age .0060 .0059 .0050 
(.0010) (.0010) (.0021) 
age’ —.00007 —.00007 —.00006 
(.00002) (.00002) (.00002) 
sole —.0001 .0008 .0006 
(.0058) (.0058) (.0061) 
constant 1.213 198 .085 
(051) (.052) (.041) 
industry dummies? no no yes 
Observations 3,784 3,784 3,784 
R- squared 143 152 162 


Note: The quantities in parentheses below the estimates are the standard errors. 


Summary 


In this chapter, we have discussed the ingredients of a successful empirical study and have provided hints 


that can improve the quality of an analysis. Ultimately, the success of any study depends crucially on the 
care and effort put into it. 


Key Terms 
Data Mining Online Databases Spreadsheet 
Internet Online Search Services Text Editor 
Misspecification Analysis Sensitivity Analysis Text (ASCID File 


Sample Empirical Projects 


Throughout the text, we have seen examples of econometric analysis that either came from or were moti- 
vated by published works. We hope these have given you a good idea about the scope of empirical analy- 
sis. We include the following list as additional examples of questions that others have found or are likely 
to find interesting. These are intended to stimulate your imagination; no attempt is made to fill in all the 


details of specific models, data requirements, or alternative estimation methods. It should be possible to 
complete these projects in one term. 
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1 Do your own campus survey to answer a question of interest at your university. For example: What 
is the effect of working on college GPA? You can ask students about high school GPA, college GPA, 
ACT or SAT scores, hours worked per week, participation in athletics, major, gender, race, and so on. 
Then, use these variables to create a model that explains GPA. How much of an effect, if any, does 
another hour worked per week have on GPA? One issue of concern is that hours worked might be 
endogenous: it might be correlated with unobserved factors that affect college GPA, or lower GPAs 
might cause students to work more. 

A better approach would be to collect cumulative GPA prior to the semester and then to obtain 
GPA for the most recent semester, along with amount worked during that semester, and the other vari- 
ables. Now, cumulative GPA could be used as a control (explanatory variable) in the equation. 


2 There are many variants on the preceding topic. You can study the effects of drug or alcohol usage, 
or of living in a fraternity, on grade point average. You would want to control for many family back- 
ground variables, as well as previous performance variables. 


3 Do gun control laws at the city level reduce violent crimes? Such questions can be difficult to answer 
with a single cross section because city and state laws are often endogenous. [See Kleck and Patterson 
(1993) for an example. They used cross-sectional data and instrumental variables methods, but their 
IVs are questionable.] Panel data can be very useful for inferring causality in these contexts. At a 
minimum, you could control for a previous year’s violent crime rate. 


4 Low and McPheters (1983) used city cross-sectional data on wage rates and estimates of risk of death 
for police officers, along with other controls. The idea is to determine whether police officers are 
compensated for working in cities with a higher risk of on-the-job injury or death. 


5 Do parental consent laws increase the teenage birthrate? You can use state level data for this: either a 
time series for a given state or, even better, a panel data set of states. Do the same laws reduce abortion 
rates among teenagers? The Statistical Abstract of the United States contains all kinds of state-level 
data. Levine, Trainor, and Zimmerman (1996) studied the effects of abortion funding restrictions on 
similar outcomes. Other factors, such as access to abortions, may affect teen birth and abortion rates. 

There is also recent interest in the effects of “abstinence-only” sex education curricula. One can 
again use state-level panel data, or maybe even panel data at the school district level, to determine the 
effects of abstinence-only approaches to sex education on various outcomes, including rates of sexu- 
ally transmitted diseases and teen birthrates. 


6 Do changes in traffic laws affect traffic fatalities? McCarthy (1994) contains an analysis of monthly 
time series data for the state of California. A set of dummy variables can be used to indicate the 
months in which certain laws were in effect. The file TRAFFIC2 contains the data used by McCarthy. 
An alternative is to obtain a panel data set on states in the United States, where you can exploit vari- 
ation in laws across states, as well as across time. Freeman (2007) is a good example of a state-level 
analysis, using 25 years of data that straddle changes in various state drunk driving, seat belt, and 
speed limit laws. The data can be found in the file DRIVING. 

Mullahy and Sindelar (1994) used individual-level data matched with state laws and taxes on 
alcohol to estimate the effects of laws and taxes on the probability of driving drunk. 


7 Are blacks discriminated against in the lending market? Hunter and Walker (1996) looked at this ques- 
tion; in fact, we used their data in Computer Exercises C.8 in Chapter 7 and C.2 in Chapter 17. 


8 Is there a marriage premium for professional athletes? Korenman and Neumark (1991) found a signifi- 
cant wage premium for married men after using a variety of econometric methods, but their analysis 
is limited because they cannot directly observe productivity. (Plus, Korenman and Neumark used men 
in a variety of occupations.) Professional athletes provide an interesting group in which to study the 
marriage premium because we can easily collect data on various productivity measures, in addition 
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10 


11 


12 


13 


to salary. The data set NBASAL, on players in the National Basketball Association (NBA), is one 
example. For each player, we have information on points scored, rebounds, assists, playing time, and 
demographics. As in Computer Exercise C.9 in Chapter 6, we can use multiple regression analysis to 
test whether the productivity measures differ by marital status. We can also use this kind of data to test 
whether married men are paid more after we account for productivity differences. (For example, NBA 
owners may think that married men bring stability to the team, or are better for the team image.) For 
individual sports—such as golf and tennis—annual earnings directly reflect productivity. Such data, 
along with age and experience, are relatively easy to collect. 


Answer this question: Are cigarette smokers less productive? A variant on this is: Do workers who 
smoke take more sick days (everything else being equal)? Mullahy and Portney (1990) use individual- 
level data to evaluate this question. You could use data at, say, the metropolitan level. Something like 
average productivity in manufacturing can be related to percentage of manufacturing workers who 
smoke. Other variables, such as average worker education, capital per worker, and size of the city (you 
can think of more), should be controlled for. 


Do minimum wages alleviate poverty? You can use state or county data to answer this question. The 
idea is that the minimum wage varies across states because some states have higher minimums than 
the federal minimum. Further, there are changes over time in the nominal minimum within a state, 
some due to changes at the federal level and some because of changes at the state level. Neumark and 
Wascher (1995) used a panel data set on states to estimate the effects of the minimum wage on the 
employment rates of young workers, as well as on school enrollment rates. 


What factors affect student performance at public schools? It is fairly easy to get school-level or at least 
district-level data in most states. Does spending per student matter? Do student-teacher ratios have any 
effects? It is difficult to estimate ceteris paribus effects because spending is related to other factors, 
such as family incomes or poverty rates. The data set MEAP93, for Michigan high schools, contains a 
measure of the poverty rates. Another possibility is to use panel data, or at least to control for a previ- 
ous year’s performance measure (such as average test score or percentage of students passing an exam). 

You can look at less obvious factors that affect student performance. For example, after control- 
ling for income, does family structure matter? Perhaps families with two parents, but only one work- 
ing for a wage, have a positive effect on performance. (There could be at least two channels: parents 
spend more time with the children, and they might also volunteer at school.) What about the effect of 
single-parent households, controlling for income and other factors? You can merge census data for one 
or two years with school district data. 

Do public schools with more charter or private schools nearby better educate their students be- 
cause of competition? There is a tricky simultaneity issue here because private schools are probably 
located in areas where the public schools are already poor. Hoxby (1994) used an instrumental vari- 
ables approach, where population proportions of various religions were IVs for the number of private 
schools. 

Rouse (1998) studied a different question: Did students who were able to attend a private school 
due to the Milwaukee voucher program perform better than those who did not? She used panel data 
and was able to control for an unobserved student effect. A subset of Rouse’s data is contained in the 
file VOUCHER. 


Can excess returns on a stock, or a stock index, be predicted by the lagged price/dividend ratio? Or by 
lagged interest rates or weekly monetary policy? It would be interesting to pick a foreign stock index, 
or one of the less well-known U.S. indexes. Cochrane (1997) provides a nice survey of recent theories 
and empirical results for explaining excess stock returns. 


Is there racial discrimination in the market for baseball cards? This involves relating the prices of 
baseball cards to factors that should affect their prices, such as career statistics, whether the player is 
in the Hall of Fame, and so on. Holding other factors fixed, do cards of black or Hispanic players sell 
at a discount? 


14 


15 


16 


17 


18 
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You can test whether the market for gambling on sports is efficient. For example, does the spread on 
football or basketball games contain all usable information for picking against the spread? The data set 
PNTSPRD contains information on men’s college basketball games. The outcome variable is binary. 
Was the spread covered or not? Then, you can try to find information that was known prior to each 
game’s being played in order to predict whether the spread is covered. (Good luck!) A useful website 
that contains historical spreads and outcomes for college football and men’s basketball games is 
www.goldsheet.com. 


What effect, if any, does success in college athletics have on other aspects of the university (applica- 
tions, quality of students, quality of nonathletic departments)? McCormick and Tinsley (1987) looked 
at the effects of athletic success at major colleges on changes in SAT scores of entering freshmen. 
Timing is important here: presumably, it is recent past success that affects current applications and 
student quality. One must control for many other factors—such as tuition and measures of school 
quality—to make the analysis convincing because, without controlling for other factors, there is a 
negative correlation between academics and athletic performance. A more recent examination of the 
link between academic and athletic performance is provided by Tucker (2004), who also looks at how 
alumni contributions are affected by athletic success. 

A variant is to match natural rivals in football or men’s basketball and to look at differences 
across schools as a function of which school won the football game or one or more basketball games. 
ATHLET1 and ATHLET2 are small data sets that could be expanded and updated. 


Collect murder rates for a sample of counties (say, from the FBI Uniform Crime Reports) for two 
years. Make the latter year such that economic and demographic variables are easy to obtain from 
the County and City Data Book. You can obtain the total number of people on death row plus 
executions for intervening years at the county level. If the years are 1990 and 1985, you might 
estimate 


mrdrtéy, = By + Bymrdrteg, + B,executions + other factors, 


where interest is in the coefficient on executions. The lagged murder rate and other factors serve 
as controls. If more than two years of data are obtained, then the panel data methods in Chapters 13 
and 14 can be applied. 

Other factors may also act as a deterrent to crime. For example, Cloninger (1991) presented a 
cross-sectional analysis of the effects of lethal police response on crime rates. 

As a different twist, what factors affect crime rates on college campuses? Does the fraction of 
students living in fraternities or sororities have an effect? Does the size of the police force matter, 
or the kind of policing used? (Be careful about inferring causality here.) Does having an escort 
program help reduce crime? What about crime rates in nearby communities? Recently, colleges 
and universities have been required to report crime statistics; in previous years, reporting was 
voluntary. 


What factors affect manufacturing productivity at the state level? In addition to levels of capital and 
worker education, you could look at degree of unionization. A panel data analysis would be most 
convincing here, using multiple years of census data, say, 1980, 1990, 2000, and 2010. Clark (1984) 
provides an analysis of how unionization affects firm performance and productivity. What other vari- 
ables might explain productivity? 

Firm-level data can be obtained from Compustat. For example, other factors being fixed, do 
changes in unionization affect stock price of a firm? 


Use state- or county-level data or, if possible, school district-level data to look at the factors that affect 
education spending per pupil. An interesting question is: Other things being equal (such as income 
and education levels of residents), do districts with a larger percentage of elderly people spend less on 
schools? Census data can be matched with school district spending data to obtain a very large cross 
section. The U.S. Department of Education compiles such data. 
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What are the effects of state regulations, such as motorcycle helmet laws, on motorcycle fatalities? 
Or do differences in boating laws—such as minimum operating age—help to explain boating acci- 
dent rates? The U.S. Department of Transportation compiles such information. This can be merged 
with data from the Statistical Abstract of the United States. A panel data analysis seems to be war- 
ranted here. 


What factors affect output growth? Two factors of interest are inflation and investment [for example, 
Blomstrém, Lipsey, and Zejan (1996)]. You might use time series data on a country you find interest- 
ing. Or you could use a cross section of countries, as in De Long and Summers (1991). Friedman and 
Kuttner (1992) found evidence that, at least in the 1980s, the spread between the commercial paper 
rate and the Treasury bill rate affects real output. 


What is the behavior of mergers in the U.S. economy (or some other economy)? Shughart and Tollison 
(1984) characterize (the log of) annual mergers in the U.S. economy as a random walk by showing 
that the difference in logs—roughly, the growth rate—is unpredictable given past growth rates. Does 
this still hold? Does it hold across various industries? What past measures of economic activity can be 
used to forecast mergers? 


What factors might explain racial and gender differences in employment and wages? For example, 
Holzer (1991) reviewed the evidence on the “spatial mismatch hypothesis” to explain differences in 
employment rates between blacks and whites. Korenman and Neumark (1992) examined the effects of 
childbearing on women’s wages, while Hersch and Stratton (1997) looked at the effects of household 
responsibilities on men’s and women’s wages. 


Obtain monthly or quarterly data on teenage employment rates, the minimum wage, and factors that 
affect teen employment to estimate the effects of the minimum wage on teen employment. Solon 
(1985) used quarterly U.S. data, while Castillo-Freeman and Freeman (1992) used annual data on 
Puerto Rico. It might be informative to analyze time series data on a low-wage state in the United 
States—where changes in the minimum wage are likely to have the largest effect. 


At the city level, estimate a time series model for crime. An example is Cloninger and Sartorius 
(1979). As a twist, you might estimate the effects of community policing or midnight basketball pro- 
grams, relatively new innovations in fighting crime. Inferring causality is tricky. Including a lagged 
dependent variable might be helpful. Because you are using time series data, you should be aware of 
the spurious regression problem. 

Grogger (1990) used data on daily homicide counts to estimate the deterrent effects of capital 
punishment. Might there be other factors—such as news on lethal response by police—that have an 
effect on daily crime counts? 


Are there aggregate productivity effects of computer usage? You would need to obtain time series 
data, perhaps at the national level, on productivity, percentage of employees using computers, and 
other factors. What about spending (probably as a fraction of total sales) on research and devel- 
opment? What sociological factors (for example, alcohol usage or divorce rates) might affect 
productivity? 


What factors affect chief executive officer salaries? The files CEOSAL1 and CEOSAL2 are 
data sets that have various firm performance measures as well as information such as tenure 
and education. You can certainly update these data files and look for other interesting factors. 
Rose and Shepard (1997) considered firm diversification as one important determinant of CEO 
compensation. 


Do differences in tax codes across states affect the amount of foreign direct investment? Hines (1996) 
studied the effects of state corporate taxes, along with the ability to apply foreign tax credits, on invest- 
ment from outside the United States. 
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What factors affect election outcomes? Does spending matter? Do votes on specific issues matter? 
Does the state of the local economy matter? See, for example, Levitt (1994) and the data sets VOTE1 
and VOTE2. Fair (1996) performed a time series analysis of U.S. presidential elections. 


Test whether stores or restaurants practice price discrimination based on race or ethnicity. Graddy 
(1997) used data on fast-food restaurants in New Jersey and Pennsylvania, along with ZIP code-level 
characteristics, to see whether prices vary by characteristics of the local population. She found that 
prices of standard items, such as sodas, increase when the fraction of black residents increases. (Her 
data are contained in the file DISCRIM.) You can collect similar data in your local area by surveying 
stores or restaurants for prices of common items and matching those with recent census data. See 
Graddy’s paper for details of her analysis. 


Do your own “audit” study to test for race or gender discrimination in hiring. (One such study is 
described in Example C.3 of Math Refresher C.) Have pairs of equally qualified friends, say, one male 
and one female, apply for job openings in local bars or restaurants. You can provide them with phony 
résumés that give each the same experience and background, where the only difference is gender (or 
race). Then, you can keep track of who gets the interviews and job offers. Neumark (1996) described 
one such study conducted in Philadelphia. A variant would be to test whether general physical attrac- 
tiveness or a specific characteristic, such as being obese or having visible tattoos or body piercings, 
plays a role in hiring decisions. You would want to use the same gender in the matched pairs, and it 
may not be easy to get volunteers for such a study. 


Following Hamermesh and Parker (2005), try to establish a link between the physical appearance of 
college instructors and student evaluations. This can be done on campus via a survey. Somewhat crude 
data can be obtained from websites that allow students to rank their professors and provide some 
information about appearance. Ideally, though, any evaluations of attractiveness are not done by cur- 
rent or former students, as those evaluations can be influenced by the grade received. 


Use panel data to study the effects of various economic policies on regional economic growth. Studying 
the effects of taxes and spending is natural, but other policies may be of interest. For example, Craig, 
Jackson, and Thomson (2007) study the effects of Small Business Association Loan Guarantee pro- 
grams on per capita income growth. 


Blinder and Watson (2014) have recently studied explanations for systematic differences in economic 
variables, particularly growth in real GDP, in the United States based on the political party of the sit- 
ting president. One might update the data to the most recent quarters and also study variables other 
than GDP, such as unemployment. 


A general question of interest to those who follow college sports is whether so-called “experts” pro- 
vide value in predicting team success. For example, at the beginning of the men’s college basketball 
season, various pre-season rankings are published. One that is commonly referenced is based on the 
so-called rating percentage index, or RPI. The RPI is computed throughout the season based on a 
team’s wins, losses, location of games played, and strength of schedule. A pre-season RPI is computed 
based on measures of success from previous seaons. An interesting question is whether the pre-season 
RPI helps to predict team success—as measured by, say, winning percentage or success in the NCAA 
basketball tournament—once other observed pre-season variables are accounted for, including simple 
measures of team performance in recent seasons, experience of the coach, and quality of the incom- 
ing recruiting class, to name a few. If the pre-season RPI is not helpful after the other factors are 
controlled for, then we conclude that the ranks are not useful as additional information. Conversely, 
perhaps the calculations underlying the pre-season RPI are so ingenious that the RPI is the only statis- 
tic that matters for predicting team performance. The data needed to test these hypotheses are readily 
available on the Internet for hundreds of teams over many years. 
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List of Journals 


The following is a partial list of popular journals containing empirical research in business, economics, and 
other social sciences. A complete list of journals can be found on the Internet at http://www.econlit.org. 


American Economic Journal: Applied Economics 
American Economic Journal: Economic Policy 
American Economic Review 

American Journal of Agricultural Economics 
American Political Science Review 

Applied Economics 

Brookings Papers on Economic Activity 
Canadian Journal of Economics 
Demography 

Economic Development and Cultural Change 
Economic Inquiry 

Economica 

Economics of Education Review 

Economics Letters 

Education Finance and Policy 

Empirical Economics 

Federal Reserve Bulletin 

International Economic Review 

International Tax and Public Finance 
Journal of Applied Econometrics 

Journal of Business and Economic Statistics 
Journal of Development Economics 

Journal of Economic Education 

Journal of Empirical Finance 

Journal of Environmental Economics and Management 
Journal of Finance 

Journal of Health Economics 

Journal of Human Resources 

Journal of Industrial Economics 

Journal of International Economics 

Journal of Labor Economics 

Journal of Monetary Economics 

Journal of Money, Credit and Banking 
Journal of Political Economy 

Journal of Public Economics 

Journal of Quantitative Criminology 

Journal of Urban Economics 

National Bureau of Economic Research Working Papers Series 
National Tax Journal 

Public Finance Quarterly 

Quarterly Journal of Economics 

Regional Science & Urban Economics 
Review of Economic Studies 

Review of Economics and Statistics 
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Data Sources 


Numerous data sources are available throughout the world. Governments of most countries compile a 
wealth of data; some general and easily accessible data sources for the United States, such as the Economic 
Report of the President, the Statistical Abstract of the United States, and the County and City Data Book, 
have already been mentioned. International financial data on many countries are published annually in 
International Financial Statistics. Various magazines, like BusinessWeek and U.S. News and World Report, 
often publish statistics—such as CEO salaries and firm performance, or ranking of academic programs— 
that are novel and can be used in an econometric analysis. 

Rather than attempting to provide a list here, we instead give some Internet addresses that are compre- 
hensive sources for economists. A very useful site for economists, called Resources for Economists on the 
Internet, is maintained by Bill Goffe at Pennsylvania State University. The address is 


http://www.rfe.org. 


This site provides links to journals, data sources, and lists of professional and academic economists. It is 
quite simple to use. 

In addition, the Journal of Applied Econometrics and the Journal of Business and Economic Statistics 
have data archives that contain data sets used in most papers published in the journals over the past several 
years. If you find a data set that interests you, this is a good way to go, as much of the cleaning and format- 
ting of the data have already been done. The downside is that some of these data sets are used in economet- 
ric analyses that are more advanced than we have learned about in this text. On the other hand, it is often 
useful to estimate simpler models using standard econometric methods for comparison. 

Many universities, such as the University of California—Berkeley, the University of Michigan, and 
the University of Maryland, maintain very extensive data sets as well as links to a variety of data sets. 
Your own library possibly contains an extensive set of links to databases in business, economics, and the 
other social sciences. The regional Federal Reserve banks, such as the one in St. Louis, manage a variety 
of data. The National Bureau of Economic Research posts data sets used by some of its researchers. State 
and federal governments now publish a wealth of data that can be accessed via the Internet. Census data 
are publicly available from the U.S. Census Bureau. (Two useful publications are the Economic Census, 
published in years ending with two and seven, and the Census of Population and Housing, published at the 
beginning of each decade.) Other agencies, such as the U.S. Department of Justice, also make data avail- 
able to the public. 


E Math Refresher A 


Basic Mathematical Tools 


his Math Refresher covers some basic mathematics that are used in econometric analysis. We 

summarize various properties of the summation operator, study properties of linear and certain 

nonlinear equations, and review proportions and percentages. We also present some special 
functions that often arise in applied econometrics, including quadratic functions and the natural loga- 
rithm. The first four sections require only basic algebra skills. Section A-5 contains a brief review of 
differential calculus; although a knowledge of calculus is not necessary to understand most of the text, 
it is used in some end-of-chapter appendices and in several of the more advanced chapters in Part 3. 


A-1 The Summation Operator and Descriptive Statistics 


The summation operator is a useful shorthand for manipulating expressions involving the sums of 
many numbers, and it plays a key role in statistics and econometric analysis. If {x; i = 1,...,n} 
denotes a sequence of n numbers, then we write the sum of these numbers as 


Sx =x T a Ee [A.1] 


i=1 


With this definition, the summation operator is easily shown to have the following properties: 


Property Sum.1: For any constant c, 
Jene: [A.2] 
i=l 

Property Sum.2: For any constant c, 


Dex; =e i [A.3] 
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Property Sum.3: If {(x, y,):2 = 1,2,...,n} is a set of n pairs of numbers, and a and b are 
constants, then 


n n 


> (ax, + by) = aDx, + bDy, [A] 


i=1 i=1 i=1 


It is also important to be aware of some things that cannot be done with the summation operator. 
Let{(x; y;): i = 1,2,...,n} again be a set of n pairs of numbers with y; # 0 for each i. Then, 


Dla) # (x) / (Èx) 


In other words, the sum of the ratios is not the ratio of the sums. In the n = 2 case, the application of 
familiar elementary algebra also reveals this lack of equality: x,/y, + x5/y. # (x, + x2)/(yı + y2). 
Similarly, the sum of the squares is not the square of the sum: >}_,x7 # (D/_,x;)?, except 
in special cases. That these two quantities are not generally equal is easiest to see when 
n= 2:38 +36 # (x, + x)? = 28 + Axx, + g. 

Given n numbers {x;:i = 1,..., n}, we compute their average or mean by adding them up and 
dividing by n: 


n 


x = (1M) x. [A.5] 


i=l 


When the x; are a sample of data on a particular variable (such as years of education), we often call 
this the sample average (or sample mean) to emphasize that it is computed from a particular set of 
data. The sample average is an example of a descriptive statistic; in this case, the statistic describes 
the central tendency of the set of points x;. 

There are some basic properties about averages that are important to understand. First, suppose 
we take each observation on x and subtract off the average: d; = x; — x (the “d” here stands for devia- 
tion from the average). Then, the sum of these deviations is always zero: 


Xa; = X(x x) = Xx bor bse nx = nx — nx = 0. 


i=1 i=1 i=1 i=1 i=1 


We summarize this as 
S(x — x) = 0. [A.6] 


A simple numerical example shows how this works. Suppose n = 5 and x, = 6, x, = l, 
X3 = —2,x, = 0, and x; = 5. Then, x = 2, and the demeaned sample is {4, —1, —4, —2, 3}. Adding 
these gives zero, which is just what equation (A.6) says. 

In our treatment of regression analysis in Chapter 2, we need to know some additional algebraic 
facts involving deviations from sample averages. An important one is that the sum of squared devia- 
tions is the sum of the squared x, minus 7 times the square of x: 


Dez) = E- ale)’, [A.7] 
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This can be shown using basic properties of the summation operator: 
(x, -— x)? = SiG? - 2xx + x) 
n 
= Sx — XY x + n(x)? 


i=1 i=1 


= Sx — 2n(x)? + n(x)? = Dx? — n(x). 


i=1 i=1 


Sa-do- 3) = Exo- y) [A] 


this is a generalization of equation (A.7). (There, y; = x; for all i.) 

The average is the measure of central tendency that we will focus on in most of this text. 
However, it is sometimes informative to use the median (or sample median) to describe the central 
value. To obtain the median of the n numbers {x}, . . . , x,}, we first order the values of the x; from 
smallest to largest. Then, if n is odd, the sample median is the middle number of the ordered observa- 
tions. For example, given the numbers {—4, 8, 2, 0, 21, —10, 18}, the median value is 2 (because the 
ordered sequence is {—10, —4, 0, 2, 8, 18, 21}). If we change the largest number in this list, 21, to 
twice its value, 42, the median is still 2. By contrast, the sample average would increase from 5 to 8, 
a sizable change. Generally, the median is less sensitive than the average to changes in the extreme 
values (large or small) in a list of numbers. This is why “median incomes” or “median housing 
values” are often reported, rather than averages, when summarizing income or housing values in a 
city or county. 

If n is even, there is no unique way to define the median because there are two numbers at the 
center. Usually, the median is defined to be the average of the two middle values (again, after ordering 
the numbers from smallest to largest). Using this rule, the median for the set of numbers {4, 12, 2, 6} 
would be (4 + 6)/2 = 5. 


A-2 Properties of Linear Functions 


Linear functions play an important role in econometrics because they are simple to interpret and 
manipulate. If x and y are two variables related by 


y = Bo + Bix, [A.9] 


then we say that y is a linear function of x, and fọ and £, are two parameters (numbers) describing 
this relationship. The intercept is Bọ, and the slope is £}. 
The defining feature of a linear function is that the change in y is always 8, times the change in x: 


Ay = B,Ax, [A.10] 


where A denotes “change.” In other words, the marginal effect of x on y is constant and equal 


to B,. 
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Linear Housing Expenditure Function 
Suppose that the relationship between monthly housing expenditure and monthly income is 
housing = 164 + .27 income. [A.11] 


Then, for each additional dollar of income, 27 cents is spent on housing. If family income increases 
by $200, then housing expenditure increases by (.27)200 = $54. This function is graphed in 
Figure A.1. 

According to equation (A.11), a family with no income spends $164 on housing, which of course 
cannot be literally true. For low levels of income, this linear function would not describe the relation- 
ship between housing and income very well, which is why we will eventually have to use other types 
of functions to describe such relationships. 

In (A.11), the marginal propensity to consume (MPC) housing out of income is .27. This is dif- 
ferent from the average propensity to consume (APC), which is 

housing ; 
-~-~ = 164/income + .27. 
income 
The APC is not constant; it is always larger than the MPC, and it gets closer to the MPC as income 
increases. 

Linear functions are easily defined for more than two variables. Suppose that y is related to two 

variables, x, and x, in the general form 


y = Bo + Bix, + Box. [A.12] 
FIGURE A.1 Graph of housing = 164 + .27 income. 
housing 
A housing 
———= = 27 
1,514 A income 
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It is rather difficult to envision this function because its graph is three-dimensional. Nevertheless, Bọ 
is still the intercept (the value of y when x, = 0 and x, = 0), and 6, and 6, measure particular slopes. 
From (A.12), the change in y, for given changes in x, and x5, is 


Ay = B,Ax, + BAx. [A.13] 
If x, does not change, that is, Ax, = 0, then we have 
Ay = B,Ax, if Ax, = 0, 


so that 6; is the slope of the relationship in the direction of x,: 
B, = if Ax, = 0 
= —]1 = z 
1 hy, 2 


Because it measures how y changes with x,, holding x, fixed, B, is often called the partial effect of x, 
on y. Because the partial effect involves holding other factors fixed, it is closely linked to the notion of 
ceteris paribus. The parameter 8, has a similar interpretation: By = Ay/Ax, if Ax, = 0, so that B, is 
the partial effect of x, on y. 


Demand for Compact Discs 


For college students, suppose that the monthly quantity demanded of compact discs is related to the 
price of compact discs and monthly discretionary income by 


quantity = 120 — 9.8 price + .03 income, 


where price is dollars per disc and income is measured in dollars. The demand curve is the relationship 
between quantity and price, holding income (and other factors) fixed. This is graphed in two dimensions in 
Figure A.2 at an income level of $900. The slope of the demand curve, —9.8, is the partial effect of price 
on quantity: holding income fixed, if the price of compact discs increases by one dollar, then the quantity 
demanded falls by 9.8. (We abstract from the fact that CDs can only be purchased in discrete units.) An 
increase in income simply shifts the demand curve up (changes the intercept), but the slope remains the same. 


FIGURE A.2 Graph of quantity = 120 — 9.8 price + .03 income, with income fixed at $900. 
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A-3 Proportions and Percentages 


Proportions and percentages play such an important role in applied economics that it is necessary to 
become very comfortable in working with them. Many quantities reported in the popular press are 
in the form of percentages; a few examples are interest rates, unemployment rates, and high school 
graduation rates. 

An important skill is being able to convert proportions to percentages and vice versa. A percent- 
age is easily obtained by multiplying a proportion by 100. For example, if the proportion of adults 
in a county with a high school degree is .82, then we say that 82% (82 percent) of adults have a high 
school degree. Another way to think of percentages and proportions is that a proportion is the deci- 
mal form of a percentage. For example, if the marginal tax rate for a family earning $30,000 per year 
is reported as 28%, then the proportion of the next dollar of income that is paid in income taxes is 
.28 (or 28¢). 

When using percentages, we often need to convert them to decimal form. For example, if a state 
sales tax is 6% and $200 is spent on a taxable item, then the sales tax paid is 200(.06) = $12. If the 
annual return on a certificate of deposit (CD) is 7.6% and we invest $3,000 in such a CD at the begin- 
ning of the year, then our interest income is 3,000(.076) = $228. As much as we would like it, the 
interest income is not obtained by multiplying 3,000 by 7.6. 

We must be wary of proportions that are sometimes incorrectly reported as percentages in 
the popular media. If we read, “The percentage of high school students who drink alcohol is .57,” 
we know that this really means 57% (not just over one-half of a percent, as the statement literally 
implies). College volleyball fans are probably familiar with press clips containing statements such as 
“Her hitting percentage was .372.” This really means that her hitting percentage was 37.2%. 

In econometrics, we are often interested in measuring the changes in various quantities. Let x 
denote some variable, such as an individual’s income, the number of crimes committed in a com- 
munity, or the profits of a firm. Let x) and x, denote two values for x: x9 is the initial value, and x, 
is the subsequent value. For example, x) could be the annual income of an individual in 1994 and x, 
the income of the same individual in 1995. The proportionate change in x in moving from xọ to x,, 
sometimes called the relative change, is simply 


(x, = Xo)/Xo = Ax/Xo, [A.14] 


assuming, of course, that x) # 0. In other words, to get the proportionate change, we simply divide 
the change in x by its initial value. This is a way of standardizing the change so that it is free of units. 
For example, if an individual’s income goes from $30,000 per year to $36,000 per year, then the pro- 
portionate change is 6,000/30,000 = .20. 

It is more common to state changes in terms of percentages. The percentage change in x in 
going from xg to x; is simply 100 times the proportionate change: 


%Ax = 100(Ax/xp); [A.15] 


the notation “%Ax” is read as “the percentage change in x.” For example, when income goes from 
$30,000 to $33,750, income has increased by 12.5%; to get this, we simply multiply the proportionate 
change, .125, by 100. 

Again, we must be on guard for proportionate changes that are reported as percentage changes. 
In the previous example, for instance, reporting the percentage change in income as .125 is incorrect 
and could lead to confusion. 

When we look at changes in things like dollar amounts or population, there is no ambiguity about 
what is meant by a percentage change. By contrast, interpreting percentage change calculations can be 
tricky when the variable of interest is itself a percentage, something that happens often in economics 
and other social sciences. To illustrate, let x denote the percentage of adults in a particular city having 
a college education. Suppose the initial value is x9 = 24 (24% have a college education), and the new 
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value is x; = 30. We can compute two quantities to describe how the percentage of college-educated 
people has changed. The first is the change in x, Ax. In this case, Ax = x, — xp = 6: the percentage 
of people with a college education has increased by six percentage points. On the other hand, we can 
compute the percentage change in x using equation (A.15): %Ax = 100[(30 — 24)/24] = 25. 

In this example, the percentage point change and the percentage change are very different. The 
percentage point change is just the change in the percentages. The percentage change is the change 
relative to the initial value. Generally, we must pay close attention to which number is being com- 
puted. The careful researcher makes this distinction perfectly clear; unfortunately, in the popular press 
as well as in academic research, the type of reported change is often unclear. 


Michigan Sales Tax Increase 


In March 1994, Michigan voters approved a sales tax increase from 4% to 6%. In political advertise- 
ments, supporters of the measure referred to this as a two percentage point increase, or an increase of 
two cents on the dollar. Opponents of the tax increase called it a 50% increase in the sales tax rate. 
Both claims are correct; they are simply different ways of measuring the increase in the sales tax. 
Naturally, each group reported the measure that made its position most favorable. 


For a variable such as salary, it makes no sense to talk of a “percentage point change in salary” 
because salary is not measured as a percentage. We can describe a change in salary either in dollar or 
percentage terms. 


A-4 Some Special Functions and Their Properties 


In Section A-2, we reviewed the basic properties of linear functions. We already indicated one impor- 
tant feature of functions like y = By + fx: a one-unit change in x results in the same change in y, 
regardless of the initial value of x. As we noted earlier, this is the same as saying the marginal effect 
of x on y is constant, something that is not realistic for many economic relationships. For exam- 
ple, the important economic notion of diminishing marginal returns is not consistent with a linear 
relationship. 

In order to model a variety of economic phenomena, we need to study several nonlinear func- 
tions. A nonlinear function is characterized by the fact that the change in y for a given change in x 
depends on the starting value of x. Certain nonlinear functions appear frequently in empirical eco- 
nomics, so it is important to know how to interpret them. A complete understanding of nonlinear 
functions takes us into the realm of calculus. Here, we simply summarize the most significant aspects 
of the functions, leaving the details of some derivations for Section A-5. 


A-4a Quadratic Functions 


One simple way to capture diminishing returns is to add a quadratic term to a linear relationship. 
Consider the equation 


y = Bo + Bix + Bow’, [A.16] 


where 6o, 61, and B, are parameters. When B, > 0 and B, < 0, the relationship between y and x has 
the parabolic shape given in Figure A.3, where By = 6, 6; = 8, and B, = —2 

When 6, > 0 and $, < 0, it can be shown (using calculus in the next section) that the maximum 
of the function occurs at the point 
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x* = B,/(—2,). [A.17] 


For example, if y = 6 + 8x — 2x” (so B, = 8 and B, = —2), then the largest value of y occurs at 
x" = 8/4 = 2, and this value is 6 + 8(2) — 2(2)? = 14 (see Figure A.3). 

The fact that equation (A.16) implies a diminishing marginal effect of x on y is easily seen from 
its graph. Suppose we start at a low value of x and then increase x by some amount, say, c. This has a 
larger effect on y than if we start at a higher value of x and increase x by the same amount c. In fact, 
once x > x", an increase in x actually decreases y. 

The statement that x has a diminishing marginal effect on y is the same as saying that the slope of 
the function in Figure A.3 decreases as x increases. Although this is clear from looking at the graph, 
we usually want to quantify how quickly the slope is changing. An application of calculus gives the 
approximate slope of the quadratic function as 


Ay 
slope = om = Bı + 2B>x, [A.18] 


for “small” changes in x. [The right-hand side of equation (A.18) is the derivative of the function in 
equation (A.16) with respect to x.] Another way to write this is 


Ay ~= (Bı + 28x) Ax for “small” Ax. [A.19] 


To see how well this approximation works, consider again the function y = 6 + 8x — 2x*. Then, 
according to equation (A.19), Ay ~ (8 — 4x)Ax. Now, suppose we start at x = 1 and change 
x by Ax = .1. Using (A.19), Ay = (8 — 4)(.1) = .4. Of course, we can compute the change 
exactly by finding the values of y when x = 1 and x = 1.1: yọ = 6 + 8(1) — 2(1)* = 12 and 
yı = 6 + 8(1.1) — 2(1.1)? = 12.38, so the exact change in y is .38. The approximation is pretty 
close in this case. 


FIGURE A.3 Graph of y = 6 + 8x — 2x’. 


x* 
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Now, suppose we start at x = 1 but change x by a larger amount: Ax = .5. Then, the approxima- 
tion gives Ay = 4(.5) = 2. The exact change is determined by finding the difference in y when x = 1 
and x = 1.5. The former value of y was 12, and the latter value is 6 + 8(1.5) — 2(1.5)* = 13.5, 
so the actual change is 1.5 (not 2). The approximation is worse in this case because the change in 
x is larger. 

For many applications, equation (A.19) can be used to compute the approximate marginal effect 
of x on y for any initial value of x and small changes. And, we can always compute the exact change 
if necessary. 


EXAMPLE A.4 A Quadratic Wage Function 
Suppose the relationship between hourly wages and years in the workforce (exper) is given by 
wage = 5.25 + .48 exper — .008 exper’. [A.20] 


This function has the same general shape as the one in Figure A.3. Using equation (A.17), exper has a 
positive effect on wage up to the turning point, exper“ = .48/[2(.008) ] = 30. The first year of experi- 
ence is worth approximately .48, or 48 cents [see (A.19) with x = 0, Ax = 1]. Each additional year 
of experience increases wage by less than the previous year—reflecting a diminishing marginal return 
to experience. At 30 years, an additional year of experience would actually lower the wage. This is 
not very realistic, but it is one of the consequences of using a quadratic function to capture a dimin- 
ishing marginal effect: at some point, the function must reach a maximum and curve downward. For 
practical purposes, the point at which this happens is often large enough to be inconsequential, but not 
always. 


The graph of the quadratic function in (A.16) has a U-shape if 8B, < 0 and £, > 0, in which case 
there is an increasing marginal return. The minimum of the function is at the point —B,/(2;). 


A-4b The Natural Logarithm 


The nonlinear function that plays the most important role in econometric analysis is the natural 
logarithm. In this text, we denote the natural logarithm, which we often refer to simply as the log 
function, as 


y = log(x). [A.21] 


You might remember learning different symbols for the natural log; In(x) or log.(x) are the most 
common. These different notations are useful when logarithms with several different bases are being 
used. For our purposes, only the natural logarithm is important, and so log(x) denotes the natural 
logarithm throughout this text. This corresponds to the notational usage in many statistical packages, 
although some use In(x) [and most calculators use In(x)]. Economists use both log(x) and In(x), 
which is useful to know when you are reading papers in applied economics. 

The function y = log(x) is defined only for x > 0, and it is plotted in Figure A.4. It is not 
very important to know how the values of log(x) are obtained. For our purposes, the function can 
be thought of as a black box: we can plug in any x > 0 and obtain log(x) from a calculator or a 
computer. 

Several things are apparent from Figure A.4. First, when y = log(x), the relationship between y 
and x displays diminishing marginal returns. One important difference between the log and the qua- 
dratic function in Figure A.3 is that when y = log(x), the effect of x on y never becomes negative: the 
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FIGURE A.4 Graph of y = log(x). 


slope of the function gets closer and closer to zero as x gets large, but the slope never quite reaches 
zero and certainly never becomes negative. 

The following are also apparent from Figure A.4: 

log(x) <Ofor0<x< 1 
log(1) = 0 
log(x) > Oforx > 1. 
In particular, log(x) can be positive or negative. Some useful algebraic facts about the log function are 
log(x;-x2) = log(x,) + log(x,), x1, x > 0 
log(x)/x) = log(x,) — log(x2), xı, x2 > 0 
log(x°) = c log(x), x > 0, c any number. 
Occasionally, we will need to rely on these properties. 

The logarithm can be used for various approximations that arise in econometric applications. 
First, log(1 + x) =~ x for x = 0. You can try this with x = .02, .1, and .5 to see how the quality of the 
approximation deteriorates as x gets larger. Even more useful is the fact that the difference in logs can 
be used to approximate proportionate changes. Let xy and x, be positive values. Then, it can be shown 
(using calculus) that 

log(x,) — log(x9) = (xı — x9)/%) = Ax/xp [A.22] 

for small changes in x. If we multiply equation (A.22) by 100 and write Alog(x) = log(x,) 
— log(xo), then 

100-Alog(x) =~ %Ax [A.23] 


for small changes in x. The meaning of “small” depends on the context, and we will encounter several 
examples throughout this text. 
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Why should we approximate the percentage change using (A.23) when the exact percentage 
change is so easy to compute? Momentarily, we will see why the approximation in (A.23) is useful in 
econometrics. First, let us see how good the approximation is in two examples. 

First, suppose x) = 40 and x, = 41. Then, the percentage change in x in moving from Xp to x, 
is 2.5%, using 100(x, — x9)/X%9. Now, log(41) — log(40) = .0247 (to four decimal places), which 
when multiplied by 100 is very close to 2.5. The approximation works pretty well. Now, con- 
sider a much bigger change: x) = 40 and x, = 60. The exact percentage change is 50%. However, 
log(60) — log(40) = .4055, so the approximation gives 40.55%, which is much farther off. 

Why is the approximation in (A.23) useful if it is only satisfactory for small changes? To build up 
to the answer, we first define the elasticity of y with respect to x as 


Ay x _ %Ay 


Ax y  %Ax’ 


[A.24] 


In other words, the elasticity of y with respect to x is the percentage change in y when x increases by 
1%. This notion should be familiar from introductory economics. 
If y is a linear function of x, y = By + Bx, then the elasticity is 


OP op epy x 
Ax y "' y "* By t By’ 


[A.25] 


which clearly depends on the value of x. (This is a generalization of the well-known result from basic 
demand theory: the elasticity is not constant along a straight-line demand curve.) 

Elasticities are of critical importance in many areas of applied economics, not just in demand the- 
ory. It is convenient in many situations to have constant elasticity models, and the log function allows 
us to specify such models. If we use the approximation in (A.23) for both x and y, then the elasticity 
is approximately equal to Alog(y)/Alog(x). Thus, a constant elasticity model is approximated by the 
equation 


log(y) = Bo + Bilog(x), [A.26] 


and B, is the elasticity of y with respect to x (assuming that x, y > 0). 


Constant Elasticity Demand Function 


If q is quantity demanded and p is price and these variables are related by 
log(q) = 4.7 — 1.25 log(p), 


then the price elasticity of demand is — 1.25. Roughly, a 1% increase in price leads to a 1.25% fall in 
the quantity demanded. 


For our purposes, the fact that 6, in (A.26) is only close to the elasticity is not important. In fact, 
when the elasticity is defined using calculus—as in Section A-S—the definition is exact. For the pur- 
poses of econometric analysis, (A.26) defines a constant elasticity model. Such models play a large 
role in empirical economics. 

Other possibilities for using the log function often arise in empirical work. Suppose that 
y > Oand 


log(y) = Bo + Bix. [A.27] 
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Then, Alog(y) = B,Ax, so 100-Alog(y) = (100-6,)Ax. It follows that, when y and x are related by 
equation (A.27), 


%Ay ~ (100 + B,) Ax. [A.28] 


EXAMPLE A.6 


Logarithmic Wage Equation 

Suppose that hourly wage and years of education are related by 
log(wage) = 2.78 + .094 educ. 

Then, using equation (A.28), 

%Awage = 100(.094)Aeduc = 9.4 Aeduc. 


It follows that one more year of education increases hourly wage by about 9.4%. 


Generally, the quantity %Ay/Ax is called the semi-elasticity of y with respect to x. The semi- 
elasticity is the percentage change in y when x increases by one unit. What we have just shown is that, 
in model (A.27), the semi-elasticity is constant and equal to 100-8,. In Example A.6, we can conve- 
niently summarize the relationship between wages and education by saying that one more year of 
education—starting from any amount of education—increases the wage by about 9.4%. This is why 
such models play an important role in economics. 

Another relationship of some interest in applied economics is 


y = By + Bilog(x), [A.29] 


where x > 0. How can we interpret this equation? If we take the change in y, we get Ay = 6, Alog(x), 
which can be rewritten as Ay = (6,/100)[100- Alog(x) ]. Thus, using the approximation in (A.23), we have 


Ay = (B,/100)(%Ax). [A.30] 


In other words, 6,/100 is the unit change in y when x increases by 1%. 


Labor Supply Function 
Assume that the labor supply of a worker can be described by 
hours = 33 + 45.1 log(wage), 
where wage is hourly wage and hours is hours worked per week. Then, from (A.30), 
Ahours ~ (45.1/100)(%Awage) = .451 %Awage. 


In other words, a 1% increase in wage increases the weekly hours worked by about .45, or slightly less than 
one-half hour. If the wage increases by 10%, then Ahours = .451(10) = 4.51, or about four and one- 
half hours. We would not want to use this approximation for much larger percentage changes in wages. 


A-4c The Exponential Function 


Before leaving this section, we need to discuss a special function that is related to the log. As motiva- 
tion, consider equation (A.27). There, log(y) is a linear function of x. But how do we find y itself as a 
function of x? The answer is given by the exponential function. 

We will write the exponential function as y = exp(x), which is graphed in Figure A.5. From 
Figure A.5, we see that exp(x) is defined for any value of x and is always greater than zero. Sometimes, 
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FIGURE A.5 Graph of y = exp(x). 


y 


the exponential function is written as y = e", but we will not use this notation. Two important values 
of the exponential function are exp(0) = 1 and exp(1) = 2.7183 (to four decimal places). 

The exponential function is the inverse of the log function in the following sense: log[exp(x) ] = x 
for all x, and exp[log(x)] = x for x > 0. In other words, the log “undoes” the exponential, and vice 
versa. (This is why the exponential function is sometimes called the anti-log function.) In particular, 
note that log(y) = By + B,x is equivalent to 


y= exp(Bo + Bx). 


If 6, > 0, the relationship between x and y has the same shape as in Figure A.5. Thus, if 
log(y) = Bo + Bx with B, > 0, then x has an increasing marginal effect on y. In Example A.6, this 
means that another year of education leads to a larger change in wage than the previous year of 
education. 

Two useful facts about the exponential function are exp(x, + x5) = exp(x,)exp(x,) and 
exp[c-log(x) ] = x°. 


A-5 Differential Calculus 


In the previous section, we asserted several approximations that have foundations in calculus. Let 


y= (x ) for some function f. Then, for small changes in x, 
dj 
Ay = ie [A.31] 
dx 


where df/dx is the derivative of the function f, evaluated at the initial point x). We also write the 
derivative as dy/dx. 
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For example, if y = log(x), then dy/dx = 1/x. Using (A.31), with dy/dx evaluated at xp, we have 
Ay = (1/%)) Ax, or Alog(x) = Ax/xo, which is the approximation given in (A.22). 

In applying econometrics, it helps to recall the derivatives of a handful of functions because we 
use the derivative to define the slope of a function at a given point. We can then use (A.31) to find the 
approximate change in y for small changes in x. In the linear case, the derivative is simply the slope of 
the line, as we would hope: if y = By + £x, then dy/dx = B,. 

If y = x°, then dy/dx = cx°~'. The derivative of a sum of two functions is the sum of the deriva- 
tives: d| f(x) + g(x) Vdx = df(x)/dx + dg(x)/dx. The derivative of a constant times any function 
is that same constant times the derivative of the function: d[cf(x) /dx = c[df(x)/dx]. These simple 
rules allow us to find derivatives of more complicated functions. Other rules, such as the product, 
quotient, and chain rules, will be familiar to those who have taken calculus, but we will not review 
those here. 

Some functions that are often used in economics, along with their derivatives, are 


y = By + Bix + Box’; dy/dx = B, + 2Box 

y = Bo + Bx; dy/dx = —B,/(x) 

y = By + By Vx; dy/dx = (B/2)x 

y = Bo + Bilog(x); dy/dx = By/x 

y = exp(By + Bx); dy/dx = Byexp(By + Bix). 


If By = 0 and 6, = 1 in this last expression, we get dy/dx = exp(x), when y = exp(x). 

In Section A-4, we noted that equation (A.26) defines a constant elasticity model when calculus 
is used. The calculus definition of elasticity is (dy/dx)-(x/y). It can be shown using properties of logs 
and exponentials that, when (A.26) holds, (dy/dx): (x/y) = Bı. 

When y is a function of multiple variables, the notion of a partial derivative becomes important. 
Suppose that 


y = f(x, x). [A.32] 


Then, there are two partial derivatives, one with respect to x, and one with respect to x,. The partial 
derivative of y with respect to x,, denoted here by dy/dx,, is just the usual derivative of (A.32) with 
respect to x,, where x, is treated as a constant. Similarly, dy/dx, is just the derivative of (A.32) with 
respect to x», holding x, fixed. 

Partial derivatives are useful for much the same reason as ordinary derivatives. We can approxi- 
mate the change in y as 


0 
Ay = a holding x, fixed. [A.33] 
x] 


Thus, calculus allows us to define partial effects in nonlinear models just as we could in linear models. 
In fact, if 


y = Bo + Bix, + Bor, 
then 


These can be recognized as the partial effects defined in Section A-2. 
A more complicated example is 


y=54 4x, + xt — 3x) + Txa. [A.34] 
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Now, the derivative of (A.34), with respect to x, (treating x, as a constant), is simply 


dy 
— = 4 + 2x, + 7x; 
Ox, 


note how this depends on x, and x. The derivative of (A.34), with respect to x, is 
dy/dX, = —3 + 7x,, so this depends only on xı. 


EXAMPLE A.8 Wage Function with Interaction 
A function relating wages to years of education and experience is 


wage = 3.10 + .41 educ + .19 exper — .004 exper’ [A.35] 
+ .007 educ: exper. 


The partial effect of exper on wage is the partial derivative of (A.35): 


dwage 


- = .19 — .008 exper + .007 educ. 
dexper 
This is the approximate change in wage due to increasing experience by one year. Notice that 
this partial effect depends on the initial level of exper and educ. For example, for a worker who 
is starting with educ = 12 and exper = 5, the next year of experience increases wage by about 
19 — .008(5) + .007(12) = .234, or 23.4 cents per hour. The exact change can be calculated by 
computing (A.35) at exper = 5, educ = 12 and at exper = 6, educ = 12, and then taking the difference. 
This turns out to be .23, which is very close to the approximation. 


Differential calculus plays an important role in minimizing and maximizing functions of one or 


more variables. If f(x, X5,...,X,) is a differentiable function of k variables, then a necessary condi- 
tion for x}, x3, .. . , x; to either minimize or maximize f over all possible values of x; is 
of x k K . 
ay kh wee Xe) = OF = 1, 2,02, [A.36] 
X; 


L 


In other words, all of the partial derivatives of f must be zero when they are evaluated at the x,*. These 
are called the first order conditions for minimizing or maximizing a function. Practically, we hope to 
solve equation (A.36) for the x,*. Then, we can use other criteria to determine whether we have mini- 
mized or maximized the function. We will not need those here. [See Sydsaeter and Hammond (1995) 
for a discussion of multivariable calculus and its use in optimizing functions. ] 


Summary 


The math tools reviewed here are crucial for understanding regression analysis and the probability and sta- 
tistics that are covered in Appendices B and C. The material on nonlinear functions—especially quadratic, 
logarithmic, and exponential functions—is critical for understanding modern applied economic research. 
The level of comprehension required of these functions does not include a deep knowledge of calculus, 
although calculus is needed for certain derivations. 


Key Terms 
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Average Intercept 

Ceteris Paribus Linear Function 
Constant Elasticity Model Log Function 
Derivative Marginal Effect 
Descriptive Statistic Median 
Diminishing Marginal Effect Natural Logarithm 
Elasticity Nonlinear Function 
Exponential Function Partial Derivative 


Problems 


Partial Effect 

Percentage Change 
Percentage Point Change 
Proportionate Change 
Relative Change 
Semi-Elasticity 

Slope 

Summation Operator 


1 The following table contains monthly housing expenditures for 10 families. 


Family 


O O N DOO fF WwW DY + 


= 
oO 


Monthly Housing 
Expenditures 
(Dollars) 


300 
440 
350 
1,100 
640 
480 
450 
700 
670 
530 


(i) Find the average monthly housing expenditure. 
(ii) Find the median monthly housing expenditure. 
(iii) If monthly housing expenditures were measured in hundreds of dollars, rather than in dollars, 


what would be the average and median expenditures? 


(iv) Suppose that family number 8 increases its monthly housing expenditure to $900, but the 
expenditures of all other families remain the same. Compute the average and median housing 


expenditures. 


2 Suppose the following equation describes the relationship between the average number of classes missed 
during a semester (missed) and the distance from school (distance, measured in miles): 


missed = 3 + 0.2 distance. 


(i) Sketch this line, being sure to label the axes. How do you interpret the intercept in this equation? 

(ii) What is the average number of classes missed for someone who lives five miles away? 

(iii) What is the difference in the average number of classes missed for someone who lives 10 miles 
away and someone who lives 20 miles away? 
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3 In Example A.2, quantity of compact discs was related to price and income by quantity = 
120 — 9.8 price + .03 income. What is the demand for CDs if price = 15 and income = 200? What 
does this suggest about using linear functions to describe demand curves? 


4 Suppose the unemployment rate in the United States goes from 6.4% in one year to 5.6% in the next. 
(i) What is the percentage point decrease in the unemployment rate? 
(ii) By what percentage has the unemployment rate fallen? 


5 Suppose that the return from holding a particular firm’s stock goes from 15% in one year to 18% in the 
following year. The majority shareholder claims that “the stock return only increased by 3%,” while 
the chief executive officer claims that “the return on the firm’s stock increased by 20%.” Reconcile 
their disagreement. 


6 Suppose that Person A earns $35,000 per year and Person B earns $42,000. 
(i) Find the exact percentage by which Person B’s salary exceeds Person A’s. 
(ii) Now, use the difference in natural logs to find the approximate percentage difference. 


7 Suppose the following model describes the relationship between annual salary (salary) and the number 
of previous years of labor market experience (exper): 


log(salary) = 10.6 + .027 exper. 


(i) What is salary when exper = 0? When exper = 5? (Hint: You will need to exponentiate.) 

(ii) Use equation (A.28) to approximate the percentage increase in salary when exper increases by 
five years. 

(iii) Use the results of part (i) to compute the exact percentage difference in salary when exper = 5 
and exper = 0. Comment on how this compares with the approximation in part (ii). 


8 Let grthemp denote the proportionate growth in employment, at the county level, from 1990 to 1995, 
and let salestax denote the county sales tax rate, stated as a proportion. Interpret the intercept and slope 
in the equation 


grthemp = .043 — .78 salestax. 


9 Suppose the yield of a certain crop (in bushels per acre) is related to fertilizer amount (in pounds 
per acre) as 


yield = 120 + .19V fertilizer. 


(i) Graph this relationship by plugging in several values for fertilizer. 
(ii) Describe how the shape of this relationship compares with a linear relationship between yield and 
fertilizer. 


10 Suppose that in a particular state a standardized test is given to all graduating seniors. Let score denote a 
student’s score on the test. Someone discovers that performance on the test is related to the size of the 
student’s graduating high school class. The relationship is quadratic: 


score = 45.6 + .082 class — .000147 class’, 


where class is the number of students in the graduating class. 

(i) How do you literally interpret the value 45.6 in the equation? By itself, is it of much interest? 
Explain. 

(ii) From the equation, what is the optimal size of the graduating class (the size that maximizes the 
test score)? (Round your answer to the nearest integer.) What is the highest achievable test score? 


11 


12 


13 
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(iii) Sketch a graph that illustrates your solution in part (ii). 

(iv) Does it seem likely that score and class would have a deterministic relationship? That is, is it 
realistic to think that once you know the size of a student’s graduating class you know, with 
certainty, his or her test score? Explain. 


Consider the line 


y = Bo + Bix. 


(i) Let (x), yı) and (x2, y2) be two points on the line. Show that (x, y) is also on the line, where 
x = (x, + x>)/2 is the average of the two values and y = (y; + y>)/2. 
(ii) Extend the result of part (i) to n points on the line, {(x, y):i = 1,...,n}. 


(i) Let {x,:i=1,2,...,m} be a set ofn data points, and let x be the average. Suppose that the units 
i are divided into two groups of sizes n, and n,, with n, + n, = n. Without loss of generality, order 
the observations as 


{xp Xz EATA Xn Xmp Xa iia X> 


so that the data points for the first group appear first. Let 


ny n 


= _,-l = _,-l 
%4 = Ny >a X: = Ng > Xi 
i=1 i=n, +1 


be the averages for the two groups. Show that 


= nı \— i na \_ 4 = 
P Xy T X = WX T WX: 
n 1 n 2 m~i 222 


so that x can be expressed as a weighted average of the averages from the two subgroups. 

Gi) Do the weights w, and w, in part (i) make intuitive sense? Explain. 

(iii) How does the finding in part (i) extend the case of g groups, where the group sizes are 
n 


lass sn? 


© Let {x;: i= 1, 2,...,n} bea set of n data points with x, > 0 for all i. Is it always true that 


a , 


i n = 
i=1 > Xi 
i=l! 


Gi) Is the equality in part (i) always true if x, = c for all i,;where c > 0? 
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Fundamentals of Probability 


his Math Refresher covers key concepts from basic probability. Appendices B and C are primar- 
ily for review; they are not intended to replace a course in probability and statistics. However, all 
of the probability and statistics concepts that we use in the text are covered in these appendices. 
Probability is of interest in its own right for students in business, economics, and other social 
sciences. For example, consider the problem of an airline trying to decide how many reservations to 
accept for a flight that has 100 available seats. If fewer than 100 people want reservations, then these 
should all be accepted. But what if more than 100 people request reservations? A safe solution is to 
accept at most 100 reservations. However, because some people book reservations and then do not 
show up for the flight, there is some chance that the plane will not be full even if 100 reservations are 
booked. This results in lost revenue to the airline. A different strategy is to book more than 100 reser- 
vations and to hope that some people do not show up, so the final number of passengers is as close to 
100 as possible. This policy runs the risk of the airline having to compensate people who are neces- 
sarily bumped from an overbooked flight. 
A natural question in this context is: Can we decide on the optimal (or best) number of reserva- 
tions the airline should make? This is a nontrivial problem. Nevertheless, given certain information 
(on airline costs and how frequently people show up for reservations), we can use basic probability to 


arrive at a solution. 


B-1 Random Variables and Their Probability Distributions 


684 


Suppose that we flip a coin 10 times and count the number of times the coin turns up heads. This is 
an example of an experiment. Generally, an experiment is any procedure that can, at least in theory, 
be infinitely repeated and has a well-defined set of outcomes. We could, in principle, carry out the 
coin-flipping procedure again and again. Before we flip the coin, we know that the number of heads 
appearing is an integer from 0 to 10, so the outcomes of the experiment are well defined. 

A random variable is one that takes on numerical values and has an outcome that is determined 
by an experiment. In the coin-flipping example, the number of heads appearing in 10 flips of a coin 
is an example of a random variable. Before we flip the coin 10 times, we do not know how many 
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times the coin will come up heads. Once we flip the coin 10 times and count the number of heads, we 
obtain the outcome of the random variable for this particular trial of the experiment. Another trial can 
produce a different outcome. 

In the airline reservation example mentioned earlier, the number of people showing up for their 
flight is a random variable: before any particular flight, we do not know how many people will show up. 

To analyze data collected in business and the social sciences, it is important to have a basic 
understanding of random variables and their properties. Following the usual conventions in probabil- 
ity and statistics throughout Appendices B and C, we denote random variables by uppercase letters, 
usually W, X, Y, and Z; particular outcomes of random variables are denoted by the corresponding 
lowercase letters, w, x, y, and z. For example, in the coin-flipping experiment, let X denote the number 
of heads appearing in 10 flips of a coin. Then, X is not associated with any particular value, but we 
know X will take on a value in the set {0, 1,2,..., 10}. A particular outcome is, say, x = 6. 

We indicate large collections of random variables by using subscripts. For example, if we record 
last year’s income of 20 randomly chosen households in the United States, we might denote these 
random variables by X,, X2, . . . , X20; the particular outcomes would be denoted x4, X2, . . . , X20- 

As stated in the definition, random variables are always defined to take on numerical values, even 
when they describe qualitative events. For example, consider tossing a single coin, where the two 
outcomes are heads and tails. We can define a random variable as follows: X = 1 if the coin turns up 
heads, and X = 0 if the coin turns up tails. 

A random variable that can only take on the values zero and one is called a Bernoulli (or binary) 
random variable. In basic probability, it is traditional to call the event X = 1 a “success” and the 
event X = 0 a “failure.” For a particular application, the success-failure nomenclature might not 
correspond to our notion of a success or failure, but it is a useful terminology that we will adopt. 


B-1a Discrete Random Variables 


A discrete random variable is one that takes on only a finite or countably infinite number of values. 
The notion of “countably infinite” means that even though an infinite number of values can be taken 
on by a random variable, those values can be put in a one-to-one correspondence with the positive 
integers. Because the distinction between “countably infinite” and “uncountably infinite” is some- 
what subtle, we will concentrate on discrete random variables that take on only a finite number of 
values. Larsen and Marx (1986, Chapter 3) provide a detailed treatment. 

A Bernoulli random variable is the simplest example of a discrete random variable. The only 
thing we need to completely describe the behavior of a Bernoulli random variable is the probability 
that it takes on the value one. In the coin-flipping example, if the coin is “fair,” then P(X = 1) = 1/2 
(read as “the probability that X equals one is one-half”). Because probabilities must sum to one, 
P(X = 0) = 1/2, also. 

Social scientists are interested in more than flipping coins, so we must allow for more general 
situations. Again, consider the example where the airline must decide how many people to book for 
a flight with 100 available seats. This problem can be analyzed in the context of several Bernoulli 
random variables as follows: for a randomly selected customer, define a Bernoulli random variable as 
X = 1 if the person shows up for the reservation, and X = 0 if not. 

There is no reason to think that the probability of any particular customer showing up is 1/2; in 
principle, the probability can be any number between 0 and 1. Call this number 0, so that 


P(X = 1) =9@ [B.1] 
P(X =0) =1-8. [B.2] 
For example, if @ = .75, then there is a 75% chance that a customer shows up after making a reservation 
and a 25% chance that the customer does not show up. Intuitively, the value of 0 is crucial in determin- 


ing the airline’s strategy for booking reservations. Methods for estimating 0, given historical data on 
airline reservations, are a subject of mathematical statistics, something we turn to in Math Refresher C. 
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More generally, any discrete random variable is completely described by listing its possible 
values and the associated probability that it takes on each value. If X takes on the k possible values 
{x,,...,x,}, then the probabilities p,, p», ... , p; are defined by 


B = P(X =x), j= 1,2,...,% [B.3] 
where each Pj is between 0 and 1 and 
pi tpt +p=l. [B.4] 


Equation (B.3) is read as: “The probability that X takes on the value x; is equal to p,” 

Equations (B.1) and (B.2) show that the probabilities of success and failure for a Bernoulli ran- 
dom variable are determined entirely by the value of 0. Because Bernoulli random variables are 
so prevalent, we have a special notation for them: X ~ Bernoulli(@) is read as “X has a Bernoulli 
distribution with probability of success equal to 0.” 

The probability density function (pdf) of X summarizes the information concerning the possible 
outcomes of X and the corresponding probabilities: 


f(x) = paj = l; Darka [B.5] 


with f(x) = 0 for any x not equal to x; for some j. In other words, for any real number x, f(x) is the 
probability that the random variable X takes on the particular value x. When dealing with more than 
one random variable, it is sometimes useful to subscript the pdf in question: fy is the pdf of X, fy is the 
pdf of Y, and so on. 

Given the pdf of any discrete random variable, it is simple to compute the probability of any 
event involving that random variable. For example, suppose that X is the number of free throws made 
by a basketball player out of two attempts, so that X can take on the three values {0, 1, 2}. Assume 
that the pdf of X is given by 


f(0) = .20, f(1) = .44, and f(2) = .36. 


The three probabilities sum to one, as they must. Using this pdf, we can calculate the probability that 
the player makes at least one free throw: P(X = 1) = P(X = 1) + P(X = 2) = .44 + .36 = .80. 
The pdf of X is shown in Figure B.1. 


FIGURE B.1 The pdf of the number of free throws made out of two attempts. 


f(x) 
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B-1b Continuous Random Variables 


A variable X is a continuous random variable if it takes on any real value with zero probability. 
This definition is somewhat counterintuitive because in any application we eventually observe some 
outcome for a random variable. The idea is that a continuous random variable X can take on so many 
possible values that we cannot count them or match them up with the positive integers, so logical con- 
sistency dictates that X can take on each value with probability zero. While measurements are always 
discrete in practice, random variables that take on numerous values are best treated as continuous. For 
example, the most refined measure of the price of a good is in terms of cents. We can imagine listing 
all possible values of price in order (even though the list may continue indefinitely), which technically 
makes price a discrete random variable. However, there are so many possible values of price that 
using the mechanics of discrete random variables is not feasible. 

We can define a probability density function for continuous random variables, and, as with 
discrete random variables, the pdf provides information on the likely outcomes of the random 
variable. However, because it makes no sense to discuss the probability that a continuous random 
variable takes on a particular value, we use the pdf of a continuous random variable only to compute 
events involving a range of values. For example, if a and b are constants where a < b, the probability 
that X lies between the numbers a and b, P(a = X = b), is the area under the pdf between points a 
and b, as shown in Figure B.2. If you are familiar with calculus, you recognize this as the integral of 
the function f between the points a and b. The entire area under the pdf must always equal one. 

When computing probabilities for continuous random variables, it is easiest to work with the 
cumulative distribution function (cdf). If X is any random variable, then its cdf is defined for any 
real number x by 


F(x) = P(X = x). [B.6] 
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For discrete random variables, (B.6) is obtained by summing the pdf over all values x; such that 
x; = x. For a continuous random variable, F (x) is the area under the pdf, f, to the left of the point x. 
Because F(x) is simply a probability, it is always between 0 and 1. Further, if x; < x, then 
P(X <= x) = P(X = x), that is, F(x,) = F(x). This means that a cdf is an increasing (or at least a 
nondecreasing) function of x. 

Two important properties of cdfs that are useful for computing probabilities are the following: 


For any number c, P(X > c) = 1 — F(c). [B.7] 
For any numbers a < b, P(a < X <= b) = F(b) — F(a). [B.8] 
In our study of econometrics, we will use cdfs to compute probabilities only for continuous random 


variables, in which case it does not matter whether inequalities in probability statements are strict or 
not. That is, for a continuous random variable X, 


P(X =c) = P(X> 0c), [B.9] 
and 
Pila<X<b)=P(a=X=b) =P(a=X<b)=P(a<X=)bD). [B.10] 


Combined with (B.7) and (B.8), equations (B.9) and (B.10) greatly expand the probability calcula- 
tions that can be done using continuous cdfs. 

Cumulative distribution functions have been tabulated for all of the important continuous distri- 
butions in probability and statistics. The most well known of these is the normal distribution, which 
we cover along with some related distributions in Section B-5. 


B-2 Joint Distributions, Conditional Distributions, and Independence 


In economics, we are usually interested in the occurrence of events involving more than one random 
variable. For example, in the airline reservation example referred to earlier, the airline might be inter- 
ested in the probability that a person who makes a reservation shows up and is a business traveler; this 
is an example of a joint probability. Or, the airline might be interested in the following conditional 
probability: conditional on the person being a business traveler, what is the probability of his or her 
showing up? In the next two subsections, we formalize the notions of joint and conditional distribu- 
tions and the important notion of independence of random variables. 


B-2a Joint Distributions and Independence 


Let X and Y be discrete random variables. Then, (X, Y) have a joint distribution, which is fully de- 
scribed by the joint probability density function of (X, Y): 


fxg x(x, y) = P(X =x, Y = y), [B.11] 


where the right-hand side is the probability that X = x and Y = y. When X and Y are continuous, a 
joint pdf can also be defined, but we will not cover such details because joint pdfs for continuous ran- 
dom variables are not used explicitly in this text. 

In one case, it is easy to obtain the joint pdf if we are given the pdfs of X and Y. In particular, 
random variables X and Y are said to be independent if, and only if, 


Íx, x(x, y) = fel) fry) [B.12] 


for all x and y, where fy is the pdf of X and fy is the pdf of Y. In the context of more than one random vari- 
able, the pdfs fx and fy are often called marginal probability density functions to distinguish them from 
the joint pdf fy y. This definition of independence is valid for discrete and continuous random variables. 
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To understand the meaning of (B.12), it is easiest to deal with the discrete case. If X and Y are 
discrete, then (B.12) is the same as 


P(X = x, Y = y) = P(X = x)P(Y = y); [B.13] 


in other words, the probability that X = x and Y = y is the product of the two probabilities P(X = x) 
and P(Y = y). One implication of (B.13) is that joint probabilities are fairly easy to compute, because 
they only require knowledge of P(X = x) and P(Y = y). 

If random variables are not independent, then they are said to be dependent. 


EXAMPLE B.1 


Free Throw Shooting 


Consider a basketball player shooting two free throws. Let X be the Bernoulli random variable equal 
to one if she or he makes the first free throw, and zero otherwise. Let Y be a Bernoulli random variable 
equal to one if he or she makes the second free throw. Suppose that she or he is an 80% free throw 
shooter, so that P(X = 1) = P(Y = 1) = .8. What is the probability of the player making both free 
throws? 

If X and Y are independent, we can easily answer this question: 
P(X = 1, Y = 1) = P(X = 1)P(Y = 1) = (.8)(.8) = .64. Thus, there is a 64% chance of making 
both free throws. If the chance of making the second free throw depends on whether the first was 
made—that is, X and Y are not independent—then this simple calculation is not valid. 


Independence of random variables is a very important concept. In the next subsection, we will 
show that if X and Y are independent, then knowing the outcome of X does not change the probabilities 
of the possible outcomes of Y, and vice versa. One useful fact about independence is that if X and Y 
are independent and we define new random variables g(X) and h(Y) for any functions g and h, then 
these new random variables are also independent. 

There is no need to stop at two random variables. If X4, X>,..., X„ are discrete random variables, 


then their joint pdf is f(x,,%,...,%,) = P(X, = x, X% = %),...,X, = x,). The random variables 
Xi, X>,...,X, are independent random variables if, and only if, their joint pdf is the product of 
the individual pdfs for any (x4, X2, . . . , x,). This definition of independence also holds for continuous 
random variables. 

The notion of independence plays an important role in obtaining some of the classic distributions 
in probability and statistics. Earlier, we defined a Bernoulli random variable as a zero-one random 
variable indicating whether or not some event occurs. Often, we are interested in the number of suc- 
cesses in a sequence of independent Bernoulli trials. A standard example of independent Bernoulli 
trials is flipping a coin again and again. Because the outcome on any particular flip has nothing to do 
with the outcomes on other flips, independence is an appropriate assumption. 

Independence is often a reasonable approximation in more complicated situations. In the airline 
reservation example, suppose that the airline accepts n reservations for a particular flight. For each 
i = 1,2,...,n, let Y; denote the Bernoulli random variable indicating whether customer i shows up: 
Y; = 1 if customer i appears, and Y; = 0 otherwise. Letting 0 again denote the probability of success 
(using reservation), each Y, has a Bernoulli(@) distribution. As an approximation, we might assume 
that the Y; are independent of one another, although this is not exactly true in reality: some people 
travel in groups, which means that whether or not a person shows up is not truly independent of 
whether all others show up. Modeling this kind of dependence is complex, however, so we might be 
willing to use independence as an approximation. 

The variable of primary interest is the total number of customers showing up out of the n 
reservations; call this variable X. Because each Y; is unity when a person shows up, we can write 
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X=Y + Y, +- + Y, Now, assuming that each Y, has probability of success 6 and that the Y; 
are independent, X can be shown to have a binomial distribution. That is, the probability density 
function of X is 


f(x) = (mea — 9)""*,x = 0,1,2,..., N, [B.14] 
| 
where (2) = a and for any integer n, n! (read “n factorial”) is defined as n! = n+ (n — 1) ° 
x x(n — x)! 


(n — 2)-+-1. By convention, 0! = 1. When a random variable X has the pdf given in (B.14), we 
write X ~ Binomial(n, 0). Equation (B.14) can be used to compute P(X = x) for any value of x from 
Oton. 

If the flight has 100 available seats, the airline is interested in P(X > 100). Suppose, initially, 
that n = 120, so that the airline accepts 120 reservations, and the probability that each person shows 
up is 0 = .85. Then, P(X > 100) = P(X = 101) + P(X = 102) +--- + P(X = 120), and each 
of the probabilities in the sum can be found from equation (B.14) with n = 120, 0 = .85, and the 
appropriate value of x (101 to 120). This is a difficult hand calculation, but many statistical packages 
have commands for computing this kind of probability. In this case, the probability that more than 
100 people will show up is about .659, which is probably more risk of overbooking than the 
airline wants to tolerate. If, instead, the number of reservations is 110, the probability of more than 
100 passengers showing up is only about .024. 


B-2b Conditional Distributions 


In econometrics, we are usually interested in how one random variable, call it Y, is related to one or 
more other variables. For now, suppose that there is only one variable whose effects we are interested in, 
call it X. The most we can know about how X affects Y is contained in the conditional distribution of 
Y given X. This information is summarized by the conditional probability density function, defined by 


fixe) = fe xx, yfl) [B.15] 


for all values of x such that f(x) > 0. The interpretation of (B.15) is most easily seen when X and Y 
are discrete. Then, 


fux(ylx) = P(Y = ylX = x), [B.16] 


where the right-hand side is read as “the probability that Y = y given that X = x.” When Y is continu- 
ous, fyx(y|x) is not interpretable directly as a probability, for the reasons discussed earlier, but condi- 
tional probabilities are found by computing areas under the conditional pdf. 

An important feature of conditional distributions is that, if X and Y are independent random vari- 
ables, knowledge of the value taken on by X tells us nothing about the probability that Y takes on vari- 
ous values (and vice versa). That is, fyy(ylx) = f(y), and fyiy(xly) = f(x). 


EXAMPLE B.2 Free Throw Shooting 


Consider again the basketball-shooting example, where two free throws are to be attempted. Assume 
that the conditional density is 


fell) = Gell = .15 


This means that the probability of the player making the second free throw depends on whether the 
first free throw was made: if the first free throw is made, the chance of making the second is .85; if the 
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first free throw is missed, the chance of making the second is .70. This implies that X and Y are not 
independent; they are dependent. 

We can still compute P(X = 1, Y = 1) provided we know P(X = 1). Assume that the probability 
of making the first free throw is .8, that is, P(X = 1) = .8. Then, from (B.15), we have 


P(X = 1, Y= 1) = P(Y = 1|X = 1) + P(X = 1) = (.85)(.8) = .68. 


B-3 Features of Probability Distributions 


For many purposes, we will be interested in only a few aspects of the distributions of random vari- 
ables. The features of interest can be put into three categories: measures of central tendency, measures 
of variability or spread, and measures of association between two random variables. We cover the last 
of these in Section B-4. 


B-3a A Measure of Central Tendency: The Expected Value 


The expected value is one of the most important probabilistic concepts that we will encounter in our 
study of econometrics. If X is a random variable, the expected value (or expectation) of X, denoted 
E(X) and sometimes uX or simply u, is a weighted average of all possible values of X. The weights 
are determined by the probability density function. Sometimes, the expected value is called the popu- 
lation mean, especially when we want to emphasize that X represents some variable in a population. 

The precise definition of expected value is simplest in the case that X is a discrete random variable 
taking on a finite number of values, say, {x,,... , Xp}. Let f(x) denote the probability density function 
of X. The expected value of X is the weighted average 

k 


E(X) = x1 f(x;) + x f (x2) tet Xf (x) = Xx F(x). [B.17] 


jal 


This is easily computed given the values of the pdf at each possible outcome of X. 


EXAMPLE B.3 Computing an Expected Value 
Suppose that X takes on the values —1, 0, and 2 with probabilities 1/8, 1/2, and 3/8, respectively. Then, 


(—1)-(1/8) + 0:(1/2) + 2-(3/8) = 5/8. 


This example illustrates something curious about expected values: the expected value of X can be 
a number that is not even a possible outcome of X. We know that X takes on the values —1, 0, or 2, 
yet its expected value is 5/8. This makes the expected value deficient for summarizing the central 
tendency of certain discrete random variables, but calculations such as those just mentioned can be 
useful, as we will see later. 

If X is a continuous random variable, then E(X) is defined as an integral: 


E(X) = faflxdr, [B.18] 


which we assume is well defined. This can still be interpreted as a weighted average. For the most 
common continuous distributions, E(X) is a number that is a possible outcome of X. In this text, we 
will not need to compute expected values using integration, although we will draw on some well- 
known results from probability for expected values of special random variables. 
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Given a random variable X and a function g(-), we can create a new random variable g(X). For 
example, if X is a random variable, then so is X° and log(X)(if X > 0). The expected value of g(X) is, 
again, simply a weighted average: 


E[g(X)] = Zg) [B.19] 


or, for a continuous random variable, 


o0 


Elg(X)] = J e(x)felx)dx. [B.20] 


Expected Value of X? 


EXAMPLE B.4 
For the random variable in Example B.3, let g(X) = X°. Then, 
E(X) = (—1)?(1/8) + (0)?(1/2) + (2)7(3/8) = 13/8. 


In Example B.3, we computed E(X) = 5/8, so that [E(X) > = 25/64. This shows that E(X?) is not the 
same as [E(X) P. In fact, for a nonlinear function g(X), E[g(X)] # g[E(X)] (except in very special 


cases). 

If X and Y are random variables, then g(x, Y) is a random variable for any function g, and so 
we can define its expectation. When X and Y are both discrete, taking on values {x,, x2, . . . , x,} and 
{yi Yo. «+ +» Ym, respectively, the expected value is 


m 


Ble(X, ¥)] = 3S Salty wife rey), 


h=1j=1 


where fy, y is the joint pdf of (X, Y). The definition is more complicated for continuous random vari- 
ables because it involves integration; we do not need it here. The extension to more than two random 
variables is straightforward. 


B-3b Properties of Expected Values 


In econometrics, we are not so concerned with computing expected values from various distributions; 
the major calculations have been done many times, and we will largely take these on faith. We will 
need to manipulate some expected values using a few simple rules. These are so important that we 
give them labels: 


Property E.1: For any constant c, E(c) = c. 
Property E.2: For any constants a and b, E(aX + b) = aE(X) + b. 


One useful implication of E.2 is that, if u = E(X), and we define a new random variable as 
Y = X — u, then E(Y) = 0; in E.2, take a = 1 andb = —p. 

As an example of Property E.2, let X be the temperature measured in Celsius at noon on a par- 
ticular day at a given location; suppose the expected temperature is E(X) = 25. If Y is the tempera- 
ture measured in Fahrenheit, then Y = 32 + (9/5)X. From Property E.2, the expected temperature in 
Fahrenheit is E(Y) = 32 + (9/5)-E(X) = 32 + (9/5):25 = 77. 

Generally, it is easy to compute the expected value of a linear function of many random variables. 
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Property E.3: If {a,, a),...,a,} are constants and {X}, X, . . . , X„} are random variables, then 
E(a,X, + aX, +--+ + a,X,) = aE(X,) + mE(X,) +--+ + a,E(X,). 


Or, using summation notation, 
e( > ax) = XaE(X)). [B.21] 
i=1 i=l 


As a special case of this, we have (with each a; = 1) 


E(x) = SEZ), [B.22] 


so that the expected value of the sum is the sum of expected values. This property is used often for 
derivations in mathematical statistics. 


EXAMPLE B.5 Finding Expected Revenue 


Let X, X, and X; be the numbers of small, medium, and large pizzas, respectively, 
sold during the day at a pizza parlor. These are random variables with expected values 
E(X,) = 25, E(X,) = 57, and E(X;) = 40. The prices of small, medium, and large pizzas are $5.50, 
$7.60, and $9.15. Therefore, the expected revenue from pizza sales on a given day is 
E(5.50 X, + 7.60 X, + 9.15 X3) = 5.50 E(X,) + 7.60 E(X,) + 9.15 E(X;) 
= 5.50(25) + 7.60(57) + 9.15(40) = 936.70, 


that is, $936.70. The actual revenue on any particular day will generally differ from this value, but this 
is the expected revenue. 


We can also use Property E.3 to show that if X ~ Binomial(n, 0), then E(X) = n0. That is, the 
expected number of successes in n Bernoulli trials is simply the number of trials times the probability 
of success on any particular trial. This is easily seen by writing X as X = Y, + Y, +- + Y,, where 
each Y; ~ Bernoulli(@). Then, 


We can apply this to the airline reservation example, where the airline makes n = 120 reserva- 
tions, and the probability of showing up is 0 = .85. The expected number of people showing 
up is 120(.85) = 102. Therefore, if there are 100 seats available, the expected number of people 
showing up is too large; this has some bearing on whether it is a good idea for the airline to make 
120 reservations. 

Actually, what the airline should do is define a profit function that accounts for the net revenue 
earned per seat sold and the cost per passenger bumped from the flight. This profit function is random 
because the actual number of people showing up is random. Let r be the net revenue from each pas- 
senger. (You can think of this as the price of the ticket for simplicity.) Let c be the compensation owed 
to any passenger bumped from the flight. Neither r nor c is random; these are assumed to be known to 
the airline. Let Y denote profits for the flight. Then, with 100 seats available, 


Y = rX if X = 100 
= 100r — c(X — 100) if X > 100. 
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The first equation gives profit if no more than 100 people show up for the flight; the second equation 
is profit if more than 100 people show up. (In the latter case, the net revenue from ticket sales is 100r, 
because all 100 seats are sold, and then c(X — 100) is the cost of making more than 100 reservations.) 
Using the fact that X has a Binomial(n,.85) distribution, where n is the number of reservations made, 
expected profits, E(Y), can be found as a function of n (and r and c). Computing E(Y) directly would 
be quite difficult, but it can be found quickly using a computer. Once values for r and c are given, the 
value of n that maximizes expected profits can be found by searching over different values of n. 


B-3c Another Measure of Central Tendency: The Median 


The expected value is only one possibility for defining the central tendency of a random variable. 
Another measure of central tendency is the median. A general definition of median is too compli- 
cated for our purposes. If X is continuous, then the median of X, say, m, is the value such that one-half 
of the area under the pdf is to the left of m, and one-half of the area is to the right of m. 

When X is discrete and takes on a finite number of odd values, the median is obtained by ordering 
the possible values of X and then selecting the value in the middle. For example, if X can take on the 
values {—4, 0, 2, 8, 10, 13, 17}, then the median value of X is 8. If X takes on an even number of val- 
ues, there are really two median values; sometimes, these are averaged to get a unique median value. 
Thus, if X takes on the values {—5, 3, 9, 17}, then the median values are 3 and 9; if we average these, 
we get a median equal to 6. 

In general, the median, sometimes denoted Med(X), and the expected value, E(X), are different. 
Neither is “better” than the other as a measure of central tendency; they are both valid ways to mea- 
sure the center of the distribution of X. In one special case, the median and expected value (or mean) 
are the same. If X has a symmetric distribution about the value y, then u is both the expected value 
and the median. Mathematically, the condition is f(u + x) = f(u — x) for all x. This case is illus- 
trated in Figure B.3. 


FIGURE B.3 A symmetric probability distribution. 
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B-3d Measures of Variability: Variance and Standard Deviation 


Although the central tendency of a random variable is valuable, it does not tell us everything we want to 
know about the distribution of a random variable. Figure B.4 shows the pdfs of two random variables with 
the same mean. Clearly, the distribution of X is more tightly centered about its mean than is the distribu- 
tion of Y. We would like to have a simple way of summarizing differences in the spreads of distributions. 


B-S3e Variance 


For a random variable X, let u = E(X). There are various ways to measure how far X is from its 
expected value, but the simplest one to work with algebraically is the squared difference, (X — u)’. 
(The squaring eliminates the sign from the distance measure; the resulting positive value corresponds 
to our intuitive notion of distance and treats values above and below u symmetrically.) This distance 
is itself a random variable because it can change with every outcome of X. Just as we needed a num- 
ber to summarize the central tendency of X, we need a number that tells us how far X is from u, on 
average. One such number is the variance, which tells us the expected distance from X to its mean: 


Var(X) = E[(X — p»)’]. [B.23] 


Variance is sometimes denoted OX or simply a°, when the context is clear. From (B.23), it follows 
that the variance is always nonnegative. 
As a computational device, it is useful to observe that 


ao? = E(X? — 2Xp + p’) = E(X) — 2p? + p = E(X’) — p. [B.24] 


In using either (B.23) or (B.24), we need not distinguish between discrete and continuous ran- 
dom variables: the definition of variance is the same in either case. Most often, we first compute 
E(X), then E(X’), and then we use the formula in (B.24). For example, if X ~ Bernoulli(@), 
then E(X) = 0, and, because X? = X, E(X?) = 0. It follows from equation (B.24) that Var(X) = 

E(X) -wp =6- 6? =@(1 - 8). 


FIGURE B.4 Random variables with the same mean but different distributions. 


pdf 
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Two important properties of the variance follow. 


Property VAR.1: Var(X) = 0 if, and only if, there is a constant c such that P(X = c) = 1, in 
which case E(X) = c. 


This first property says that the variance of any constant is zero and if a random variable has zero 
variance, then it is essentially constant. 


Property VAR.2: For any constants a and b, Var(aX + b) = a?Var(X). 


This means that adding a constant to a random variable does not change the variance, but multiplying 
a random variable by a constant increases the variance by a factor equal to the square of that constant. 
For example, if X denotes temperature in Celsius and Y = 32 + (9/5)X is temperature in Fahrenheit, 
then Var(Y) = (9/5)?Var(X) = (81/25)Var(X). 


B-3f Standard Deviation 


The standard deviation of a random variable, denoted sd(X), is simply the positive square root of 
the variance: sd(X) = +VVar(X). The standard deviation is sometimes denoted oy, or simply o, 
when the random variable is understood. Two standard deviation properties immediately follow from 
Properties VAR.1 and VAR.2. 


Property SD.1: For any constant c, sd(c) = 0. 
Property SD.2: For any constants a and b, 
sd(aX + b) = |alsd(X). 


In particular, if a > 0, then sd(aX) = a-sd(X). 

This last property makes the standard deviation more natural to work with than the variance. For 
example, suppose that X is a random variable measured in thousands of dollars, say, income. If we 
define Y = 1,000X, then Y is income measured in dollars. Suppose that E(X) = 20, and sd(X) = 6. 
Then, E(Y) = 1,000E(X) = 20,000, and sd(Y) = 1,000-sd(X) = 6,000, so that the expected value 
and standard deviation both increase by the same factor, 1,000. If we worked with variance, we would 
have Var(Y) = (1,000)?Var(X), so that the variance of Y is one million times larger than the variance 
of X. 


B-3g Standardizing a Random Variable 


As an application of the properties of variance and standard deviation—and a topic of practical inter- 
est in its own right—suppose that given a random variable X, we define a new random variable by 
subtracting off its mean m and dividing by its standard deviation o: 


Xp 


zao—* [B.25] 
which we can write as Z = aX + b, where a = (1/0) and b = — (w/o ). Then, from Property E.2, 
E(Z) = aE(X) + b = (wo) — (wo) = 0. 


From Property VAR.2, 
Var(Z) = a@’Var(X) = (0’/o") = 1. 
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Thus, the random variable Z has a mean of zero and a variance (and therefore a standard deviation) 
equal to one. This procedure is sometimes known as standardizing the random variable X, and Z is 
called a standardized random variable. (In introductory statistics courses, it is sometimes called 
the z-transform of X.) It is important to remember that the standard deviation, not the variance, ap- 
pears in the denominator of (B.25). As we will see, this transformation is frequently used in statistical 
inference. 

As a specific example, suppose that E(X) = 2, and Var(X) = 9. Then, Z = (X — 2)/3 has ex- 
pected value zero and variance one. 


B-3h Skewness and Kurtosis 


We can use the standardized version of a random variable to define other features of the distribution 
of a random variable. These features are described by using what are called higher order moments. 
For example, the third moment of the random variable Z in (B.25) is used to determine whether a dis- 
tribution is symmetric about its mean. We can write 


E(Z’) = E[(X — uP yo. 


If X has a symmetric distribution about yw, then Z has a symmetric distribution about zero. (The divi- 
sion by g° does not change whether the distribution is symmetric.) That means the density of Z at any 
two points z and —z is the same, which means that, in computing E(Z°), positive values z? when z > 0 
are exactly offset with the negative value (—z)*? = —z’. It follows that, if X is symmetric about zero, 
then E(Z) = 0. Generally, E[(X — )?\/o? is viewed as a measure of skewness in the distribution 
of X. In a statistical setting, we might use data to estimate E(Z*) to determine whether an underlying 
population distribution appears to be symmetric. (Computer Exercise C4 in Chapter 5 provides an 
illustration.) 
It also can be informative to compute the fourth moment of Z, 


BGY = E[(X = p)*Vo*. 


Because Z* = 0, E(Z*) = 0 (and, in any interesting case, strictly greater than zero). Without having 
a reference value, it is difficult to interpret values of E(Z*), but larger values mean that the tails in the 
distribution of X are thicker. The fourth moment E(Z*) is called a measure of kurtosis in the distribu- 
tion of X. In Section B-5, we will obtain E(Z') for the normal distribution. 


B-4 Features of Joint and Conditional Distributions 


B-4a Measures of Association: Covariance and Correlation 


While the joint pdf of two random variables completely describes the relationship between them, it is 
useful to have summary measures of how, on average, two random variables vary with one another. 
As with the expected value and variance, this is similar to using a single number to summarize some- 
thing about an entire distribution, which in this case is a joint distribution of two random variables. 


B-4b Covariance 


Let uy = E(X) and py = E(Y) and consider the random variable (X — py)(Y — py). Now, if X is 
above its mean and Y is above its mean, then (X — uy)(Y — py) > 0. This is also true if X < py and 
Y < py. On the other hand, if X > wy and Y < py, or vice versa, then (X — pry)(Y — py) < 0. How, 
then, can this product tell us anything about the relationship between X and Y? 
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The covariance between two random variables X and Y, sometimes called the population covari- 
ance to emphasize that it concerns the relationship between two variables describing a population, is 
defined as the expected value of the product (X — py)(Y — py): 


Cov(X, Y) = E[(X — ux) (Y = by) |, [B.26] 


which is sometimes denoted oyy. If oxy > 0, then, on average, when X is above its mean, Y is also 
above its mean. If oxy < 0, then, on average, when X is above its mean, Y is below its mean. 
Several expressions useful for computing Cov(X, Y) are as follows: 


Cov(X, Y) = E[(X — px)(¥ — my) ] = E[(X — wy) ¥] 
= E[X(Y by) | = E(XY) MxbMy- [B.27] 


It follows from (B.27), that if E(X) = 0 or E(Y) = 0, then Cov(X, Y) = E(XY). 

Covariance measures the amount of linear dependence between two random variables. A positive 
covariance indicates that two random variables move in the same direction, while a negative covari- 
ance indicates they move in opposite directions. Interpreting the magnitude of a covariance can be a 
little tricky, as we will see shortly. 

Because covariance is a measure of how two random variables are related, it is natural to ask how 
covariance is related to the notion of independence. This is given by the following property. 


Property COV.1: If X and Y are independent, then Cov(X, Y) = 0. 


This property follows from equation (B.27) and the fact that E(XY) = E(X)E(Y) when X and Y are 
independent. It is important to remember that the converse of COV.1 is not true: zero covariance 
between X and Y does not imply that X and Y are independent. In fact, there are random variables X 
such that, if Y = X, Cov(X, Y) = 0. [Any random variable with E(X) = 0 and E(X?) = 0 has this 
property.] If Y = X°, then X and Y are clearly not independent: once we know X, we know Y. It seems 
rather strange that X and X? could have zero covariance, and this reveals a weakness of covariance as a 
general measure of association between random variables. The covariance is useful in contexts when 
relationships are at least approximately linear. 
The second major property of covariance involves covariances between linear functions. 


Property COV.2: For any constants aj, b4, a, and by, 
Cov(a,X + by, aY + by) = aya,Cov(X, Y). [B.28] 


An important implication of COV.2 is that the covariance between two random variables can be al- 
tered simply by multiplying one or both of the random variables by a constant. This is important in 
economics because monetary variables, inflation rates, and so on can be defined with different units 
of measurement without changing their meaning. 

Finally, it is useful to know that the absolute value of the covariance between any two random 
variables is bounded by the product of their standard deviations; this is known as the Cauchy-Schwartz 
inequality. 


Property COV.3: |Cov(X, Y)| = sd(X)sd(¥). 


B-4c Correlation Coefficient 


Suppose we want to know the relationship between amount of education and annual earnings in 
the working population. We could let X denote education and Y denote earnings and then compute 
their covariance. But the answer we get will depend on how we choose to measure education and 
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earnings. Property COV.2 implies that the covariance between education and earnings depends on 
whether earnings are measured in dollars or thousands of dollars, or whether education is mea- 
sured in months or years. It is pretty clear that how we measure these variables has no bearing 
on how strongly they are related. But the covariance between them does depend on the units of 
measurement. 

The fact that the covariance depends on units of measurement is a deficiency that is overcome by 
the correlation coefficient between X and Y: 


Cov(X, Y) Oxy 
sd(X)-sd(Y) ayy’ 


Corr(X, Y) = [B.29] 
the correlation coefficient between X and Y is sometimes denoted pyy (and is sometimes called the 
population correlation). 

Because oy and oy are positive, Cov(X, Y) and Corr(X, Y) always have the same sign, and 
Corr(X, Y) = 0 if, and only if, Cov(X, Y) = 0. Some of the properties of covariance carry over to 
correlation. If X and Y are independent, then Corr(X, Y) = 0, but zero correlation does not imply 
independence. (Like the covariance, the correlation coefficient is also a measure of linear depen- 
dence.) However, the magnitude of the correlation coefficient is easier to interpret than the size of the 
covariance due to the following property. 


Property CORR.1: —1 < Cor(X, Y) = 1. 


If Corr(X, Y) = 0, or equivalently Cov(X, Y) = 0, then there is no linear relationship between 
X and Y, and X and Y are said to be uncorrelated random variables; otherwise, X and Y are cor- 
related. Corr(X, Y) = 1 implies a perfect positive linear relationship, which means that we can write 
Y = a + bX for some constant a and some constant b > 0. Corr(X, Y) = —1 implies a perfect nega- 
tive linear relationship, so that Y = a + bX for some b < 0. The extreme cases of positive or negative 
1 rarely occur. Values of pyy closer to 1 or —1 indicate stronger linear relationships. 

As mentioned earlier, the correlation between X and Y is invariant to the units of measurement of 
either X or Y. This is stated more generally as follows. 


Property CORR.2: For constants a, bj, a, and b», with aja, > 0, 
Corr(a,X + bi, œY + b) = Corr(X, Y). 

If aya, < 0, then 
Corr(a,X + bi, œY + b) = —Corr(X, Y). 


As an example, suppose that the correlation between earnings and education in the working popula- 
tion is .15. This measure does not depend on whether earnings are measured in dollars, thousands of 
dollars, or any other unit; it also does not depend on whether education is measured in years, quarters, 
months, and so on. 


B-4d Variance of Sums of Random Variables 


Now that we have defined covariance and correlation, we can complete our list of major properties of 
the variance. 


Property VAR.3: For constants a and b, 


Var(aX + bY) = a’Var(X) + b’Var(Y) + 2abCov(X, Y). 
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It follows immediately that, if X and Y are uncorrelated—so that Cov(X, Y) = 0—then 

Var(X + Y) = Var(X) + Var(Y) [B.30] 
and 

Var(X — Y) = Var(X) + Var(Y). [B.31] 


In the latter case, note how the variance of the difference is the sum of the variances, not the differ- 
ence in the variances. 

As an example of (B.30), let X denote profits earned by a restaurant during a Friday night and let Y 
be profits earned on the following Saturday night. Then, Z = X + Y is profits for the two nights. Sup- 
pose X and Y each have an expected value of $300 and a standard deviation of $15 (so that the vari- 
ance is 225). Expected profits for the two nights is E(Z) = E(X) + E(Y) = 2:(300) = 600 dollars. 
If X and Y are independent, and therefore uncorrelated, then the variance of total profits is the sum of 
the variances: Var(Z) = Var(X) + Var(Y) = 2:(225) = 450. It follows that the standard deviation 
of total profits is V450 or about $21.21. 

Expressions (B.30) and (B.31) extend to more than two random variables. To state this extension, 
we need a definition. The random variables {X,, . . . , X„} are pairwise uncorrelated random variables 
if each variable in the set is uncorrelated with every other variable in the set. That is, Cov(X,, X) = 0, 
for alli # j. 


Property VAR.4: If {X,,...,X,} are pairwise uncorrelated random variables and 
agi = 1,...,n are constants, then 


Var(a,X,; + + a,X,) = aiVar(X,) +- + @Var(X,). 


In summation notation, we can write 


Var( Sax) = Sa?Var(X,). [B.32] 
i=] i=1 
A special case of Property VAR.4 occurs when we take a; = 1 for all 7. Then, for pairwise uncorre- 
lated random variables, the variance of the sum is the sum of the variances: 


Va( 3x) = > Var(x;). [B.33] 


i=1 


Because independent random variables are uncorrelated (see Property COV.1), the variance of a sum 
of independent random variables is the sum of the variances. 

If the X; are not pairwise uncorrelated, then the expression for Var( >/_ ,a;X;) is much more com- 
plicated; we must add to the right-hand side of (B.32) the terms 2a,ajCov(x;, x) for all i >j. 

We can use (B.33) to derive the variance for a binomial random variable. Let X ~ Binomial(n, 0) 
and write X = Y, +- + Y, where the Y; are independent Bernoulli (0) random variables. Then, by 
(B.33), Var(X) = Var(Y,) + --- + Var(Y,) = n6(1 — 0). 

In the airline reservation example with n = 120 and 0 = .85, the variance of the number of pas- 
sengers arriving for their reservations is 120(.85)(.15) = 15.3, so the standard deviation is about 3.9. 


B-4e Conditional Expectation 


Covariance and correlation measure the linear relationship between two random variables and treat 
them symmetrically. More often in the social sciences, we would like to explain one variable, called Y, 
in terms of another variable, say, X. Further, if Y is related to X in a nonlinear fashion, we would like 
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to know this. Call Y the explained variable and X the explanatory variable. For example, Y might be 
hourly wage, and X might be years of formal education. 

We have already introduced the notion of the conditional probability density function of Y given X. 
Thus, we might want to see how the distribution of wages changes with education level. However, we 
usually want to have a simple way of summarizing this distribution. A single number will no longer 
suffice, because the distribution of Y given X = x generally depends on the value of x. Nevertheless, 
we can summarize the relationship between Y and X by looking at the conditional expectation of Y 
given X, sometimes called the conditional mean. The idea is this. Suppose we know that X has taken 
on a particular value, say, x. Then, we can compute the expected value of Y, given that we know this 
outcome of X. We denote this expected value by E(Y|X = x), or sometimes E(Y|x) for shorthand. 
Generally, as x changes, so does E(Y|x). 

When Y is a discrete random variable taking on values {y,,..., Ym}, then 


m 


E(Y\x) = Dvifixlyle). 


When Y is continuous, E( Y|x) is defined by integrating yfyx(y|x) over all possible values of y. As with 
unconditional expectations, the conditional expectation is a weighted average of possible values of 
Y, but now the weights reflect the fact that X has taken on a specific value. Thus, E(Y|x) is just some 
function of x, which tells us how the expected value of Y varies with x. 

As an example, let (X, Y) represent the population of all working individuals, where X is years 
of education and Y is hourly wage. Then, E(Y|X = 12) is the average hourly wage for all people in 
the population with 12 years of education (roughly a high school education). E(Y|X = 16) is the 
average hourly wage for all people with 16 years of education. Tracing out the expected value for 
various levels of education provides important information on how wages and education are related. 
See Figure B.5 for an illustration. 


FIGURE B.5 The expected value of hourly wage given various levels of education. 


E(WAGE|EDUC) 


4 8 12 16 20 EDUC 
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In principle, the expected value of hourly wage can be found at each level of education, and 
these expectations can be summarized in a table. Because education can vary widely—and can even 
be measured in fractions of a year—this is a cumbersome way to show the relationship between 
average wage and amount of education. In econometrics, we typically specify simple functions that 
capture this relationship. As an example, suppose that the expected value of WAGE given EDUC is 
the linear function 


E(WAGE|EDUC) = 1.05 + .45 EDUC. 


If this relationship holds in the population of working people, the average wage for people with eight 
years of education is 1.05 + .45(8) = 4.65, or $4.65. The average wage for people with 16 years of 
education is 8.25, or $8.25. The coefficient on EDUC implies that each year of education increases 
the expected hourly wage by .45, or 45¢. 

Conditional expectations can also be nonlinear functions. For example, suppose that 
E(Y|x) = 10/x, where X is a random variable that is always greater than zero. This function is graphed 
in Figure B.6. This could represent a demand function, where Y is quantity demanded and X is price. 
If Y and X are related in this way, an analysis of linear association, such as correlation analysis, would 
be incomplete. 


B-4f Properties of Conditional Expectation 


Several basic properties of conditional expectations are useful for derivations in econometric analysis. 
Property CE.1: E[c(X)|X] = c(X), for any function c(X). 


This first property means that functions of X behave as constants when we compute expectations con- 
ditional on X. For example, E(X’|X) = X. Intuitively, this simply means that if we know X, then we 
also know X°. 


Property CE.2: For functions a(X) and b(X), 


Ela(X)Y + b(X)|X] = a(X)E(¥|X) + b(X). 


FIGURE B.6 Graph of E(Y|x) = 10/x. 


E(Y|x) 107 


E(Y|x) = 10/x 
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For example, we can easily compute the conditional expectation of a function such as 
XY + 2X7: E(XY + 2X?|X) = XE(Y|X) + 2X. 
The next property ties together the notions of independence and conditional expectations. 


Property CE.3: If X and Y are independent, then E(Y|X) = E(Y). 


This property means that, if X and Y are independent, then the expected value of Y given X does not 
depend on X, in which case, E(Y|X) always equals the (unconditional) expected value of Y. In the 
wage and education example, if wages were independent of education, then the average wages of high 
school and college graduates would be the same. Because this is almost certainly false, we cannot as- 
sume that wage and education are independent. 

A special case of Property CE.3 is the following: if U and X are independent and E(U) = 0, then 
E(U|X) = 0. 

There are also properties of the conditional expectation that have to do with the fact that E(Y|X) 
is a function of X, say, E(Y|X) = w(X). Because X is a random variable, u(X) is also a random vari- 
able. Furthermore, (X) has a probability distribution and therefore an expected value. Generally, the 
expected value of u(X) could be very difficult to compute directly. The law of iterated expectations 
says that the expected value of u(X) is simply equal to the expected value of Y. We write this as 
follows. 


Property CE.4: E[E(¥|X)] = E(Y). 


This property is a little hard to grasp at first. It means that, if we first obtain E(Y|X) as a function 
of X and take the expected value of this (with respect to the distribution of X, of course), then we end 
up with E(Y). This is hardly obvious, but it can be derived using the definition of expected values. 

As an example of how to use Property CE.4, let Y = WAGE and X = EDUC, where WAGE is mea- 
sured in hours and EDUC is measured in years. Suppose the expected value of WAGE given EDUC is 
E(WAGE|EDUC) = 4 + .60 EDUC. Further, E(EDUC) = 11.5. Then, the law of iterated expecta- 
tions implies thatE(WAGE) = E(4 + .60 EDUC) = 4 + .60 E(EDUC) = 4 + .60(11.5) = 10.90, 
or $10.90 an hour. 

The next property states a more general version of the law of iterated expectations. 


Property CE.4': E(Y|X) = E[E(Y 


X, Z)|X]. 


In other words, we can find E(Y|X) in two steps. First, find E(Y|X, Z) for any other random vari- 
able Z. Then, find the expected value of E(Y|X, Z), conditional on X. 


Property CE.5: If E(Y|X) = E(Y), then Cov(X, Y) = 0 [and so Corr(X, Y) = 0]. In fact, every 
function of X is uncorrelated with Y. 


This property means that, if knowledge of X does not change the expected value of Y, then X and Y 
must be uncorrelated, which implies that if X and Y are correlated, then E(Y|X) must depend on X. 
The converse of Property CE.5 is not true: if X and Y are uncorrelated, E(Y|X) could still depend 
on X. For example, suppose Y = X°. Then, E(Y|X) = X°, which is clearly a function of X. However, 
as we mentioned in our discussion of covariance and correlation, it is possible that X and X? are un- 
correlated. The conditional expectation captures the nonlinear relationship between X and Y that cor- 
relation analysis would miss entirely. 

Properties CE.4 and CE.5 have two important implications: if U and X are random variables such 
that E(U|X) = 0, then E(U) = 0, and U and X are uncorrelated. 


Property CE.6: If E(Y) < œ and E[g(X)?] < œ% for some function g, then E{[Y — 
eX) PIX} < E{[Y — g(X) PIX} and E{[Y — u(x) P} = E{[Y — (xX) P3- 
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Property CE.6 is very useful in predicting or forecasting contexts. The first inequality says that, if we 
measure prediction inaccuracy as the expected squared prediction error, conditional on X, then the 
conditional mean is better than any other function of X for predicting Y. The conditional mean also 
minimizes the unconditional expected squared prediction error. 


B-4g Conditional Variance 


Given random variables X and Y, the variance of Y, conditional on X = x, is simply the variance as- 
sociated with the conditional distribution of Y, given X = x: E{[Y — E(Y|x) ?|x}. The formula 


Var(¥|X = x) = E(Y*|x) — [E(Y]x) P 


is often useful for calculations. Only occasionally will we have to compute a conditional variance. 
But we will have to make assumptions about and manipulate conditional variances for certain topics 
in regression analysis. 

As an example, let Y = SAVING and X = INCOME (both of these measured annually for the 
population of all families). Suppose that Var(SAVING|INCOME) = 400 + .25 INCOME. This says 
that, as income increases, the variance in saving levels also increases. It is important to see that the 
relationship between the variance of SAVING and INCOME is totally separate from that between the 
expected value of SAVING and INCOME. 

We state one useful property about the conditional variance. 


Property CV.1: If X and Y are independent, then Var(¥|X) = Var(Y). 


This property is pretty clear, as the distribution of Y given X does not depend on X, and Var(¥|X) is 
just one feature of this distribution. 


B-5 The Normal and Related Distributions 
B-5a The Normal Distribution 


The normal distribution and those derived from it are the most widely used distributions in statistics 
and econometrics. Assuming that random variables defined over populations are normally distributed 
simplifies probability calculations. In addition, we will rely heavily on the normal and related distri- 
butions to conduct inference in statistics and econometrics—even when the underlying population is 
not necessarily normal. We must postpone the details, but be assured that these distributions will arise 
many times throughout this text. 

A normal random variable is a continuous random variable that can take on any value. Its prob- 
ability density function has the familiar bell shape graphed in Figure B.7. 

Mathematically, the pdf of X can be written as 


F(x) = p)/207], =% < x < %, [B.34] 


il 
Wan 


where u = E(X) and o° = Var(X). We say that X has a normal distribution with expected value u 
and variance g°, written as X ~ Normal( u, o°). Because the normal distribution is symmetric about 
bt, a is also the median of X. The normal distribution is sometimes called the Gaussian distribution 
after the famous mathematician C. F. Gauss. 

Certain random variables appear to roughly follow a normal distribution. Human heights and 
weights, test scores, and county unemployment rates have pdfs roughly the shape in Figure B.7. Other dis- 
tributions, such as income distributions, do not appear to follow the normal probability function. In most 
countries, income is not symmetrically distributed about any value; the distribution is skewed toward the 
upper tail. In some cases, a variable can be transformed to achieve normality. A popular transformation is 
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FIGURE B.7 The general shape of the normal probability density function. 


f, for a normal 


ra random variable 


the natural log, which makes sense for positive random variables. If X is a positive random variable, such 
as income, and Y = log(X) has a normal distribution, then we say that X has a lognormal distribution. 
It turns out that the lognormal distribution fits income distribution pretty well in many countries. Other 
variables, such as prices of goods, appear to be well described as lognormally distributed. 


B-5b The Standard Normal Distribution 


One special case of the normal distribution occurs when the mean is zero and the variance (and, there- 
fore, the standard deviation) is unity. If a random variable Z has a Normal(0,1) distribution, then we 
say it has a standard normal distribution. The pdf of a standard normal random variable is denoted 
p(z); from (B.34), with u = 0 and o° = 1, it is given by 


(z) = ye Ol 2/2), =% <z < om, [B.35] 

The standard normal cumulative distribution function is denoted ®(z) and is obtained as the 
area under œ, to the left of z; see Figure B.8. Recall that ®(z) = P(Z < z); because Z is continuous, 
®(z) = P(Z < z) as well. 

No simple formula can be used to obtain the values of ®(z) [because ®(z) is the integral of 
the function in (B.35), and this integral has no closed form]. Nevertheless, the values for ®(z) are 
easily tabulated; they are given for z between —3.1 and 3.1 in Table G.1 in Statistical Tables. For 
z = —3.1, ®(z) is less than .001, and for z = 3.1, ®(z) is greater than .999. Most statistics and 
econometrics software packages include simple commands for computing values of the standard nor- 
mal cdf, so we can often avoid printed tables entirely and obtain the probabilities for any value of z. 

Using basic facts from probability—and, in particular, properties (B.7) and (B.8) concerning 
cdfs—we can use the standard normal cdf for computing the probability of any event involving a 
standard normal random variable. The most important formulas are 


P(Z >z) = 1 = ®(z), [B.36] 
P(Z < -z) = P(Z> 2), [B.37] 
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FIGURE B.8 The standard normal cumulative distribution function. 


and 
Pla = Z= b) = (b) — (a). [B.38] 


Because Z is a continuous random variable, all three formulas hold whether or not the in- 
equalities are strict. Some examples include P(Z > .44) = 1 — .67 = .33, P(Z < —.92) = 
P(Z > .92) = 1 — .821 = .179, and P(—1 < Z = .5) = .692 — .159 = .533. 

Another useful expression is that, for any c > 0, 


P(|Z| > c) = P(Z>c) + P(Z< —c) [B.39] 
=2-P(Z>c) = 2[1 — P(c)]. 


Thus, the probability that the absolute value of Z is bigger than some positive constant c is simply 
twice the probability P(Z > c); this reflects the symmetry of the standard normal distribution. 

In most applications, we start with a normally distributed random variable, X ~ Normal(, o°), 
where p is different from zero and g? # 1. Any normal random variable can be turned into a standard 
normal using the following property. 


Property Normal.1: If X ~ Normal( u, o°), then (X — w)/o ~ Normal(0, 1). 


Property Normal.1 shows how to turn any normal random variable into a standard normal. Thus, 
suppose X ~ Normal(3, 4), and we would like to compute P(X < 1). The steps always involve the 
normalization of X to a standard normal: 
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EXAMPLE B.6 Probabilities for a Normal Random Variable 


First, let us compute P(2 < X < 6) when X ~ Normal(4,9) (whether we use < or <£ is irrelevant 
because X is a continuous random variable). Now, 
2-4 xX-4 6-4 
P(2<X<6)=P ; < 3 < 7 = P(-2/3 < Z = 2/3) 
= ®(.67) — ®(—.67) = .749 — .251 = .498. 


Now, let us compute P(|X| > 2): 


P(|X| > 2) = P(X > 2) + P(X < —2) 
P[(X — 4)/3 > (2 — 4)/3] + P[(X — 4)⁄3 < (-2 — 4) ] 
1 — &(—2/3) + &(-2) 
1 — .251 + .023 = .772. 


B-5c Additional Properties of the Normal Distribution 


We end this subsection by collecting several other facts about normal distributions that we will 
later use. 


Property Normal.2: If X ~ Normal(, o°), then aX + b ~ Normal(ay + b, a’o’). 


Thus, if X ~ Normal(1,9), then Y = 2X + 3 is distributed as normal with mean 2E(X) + 3 = 5 and 
variance 27-9 = 36; sd(Y) = 2sd(X) = 2:3 = 6. 

Earlier, we discussed how, in general, zero correlation and independence are not the same. 
In the case of normally distributed random variables, it turns out that zero correlation suffices for 
independence. 


Property Normal.3: If X and Y are jointly normally distributed, then they are independent if, 
and only if, Cov(X, Y) = 0. 


Property Normal.4: Any linear combination of independent, identically distributed normal 
random variables has a normal distribution. 


For example, let X;, fori = 1, 2, and 3, be independent random variables distributed as Normal( h, o’). 
Define W = X, + 2X, — 3X;. Then, W is normally distributed; we must simply find its mean and 
variance. Now, 


E(W) = E(X,) + 2E(X,) — 3E(X3) = u + 2u — 3u = 0. 
Also, 
Var(W) = Var(X,) + 4Var(X,) + 9Var(X;) = 140°. 


Property Normal.4 also implies that the average of independent, normally distributed random 
variables has a normal distribution. If Y;, Y5,..., Y, are independent random variables and each is 


2 


distributed as Normal(j, o°), then 


Y ~ Normal(, 0?/n). [B.40] 
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This result is critical for statistical inference about the mean in a normal population. 

Other features of the normal distribution are worth knowing, although they do not play a central 
role in the text. Because a normal random variable is symmetric about its mean, it has zero skewness, 
that is, E[(X — «)?] = 0. Further, it can be shown that 


E[(X — p)*Vo* = 3, 


or E(Z') = 3, where Z has a standard normal distribution. Because the normal distribution is so prev- 
alent in probability and statistics, the measure of kurtosis for any given random variable X (whose 
fourth moment exists) is often defined to be E[(X — y)*)/a* — 3, that is, relative to the value for the 
standard normal distribution. If E[(X — )*)/o* > 3, then the distribution of X has fatter tails than 
the normal distribution (a somewhat common occurrence, such as with the ¢ distribution to be intro- 
duced shortly); if E[(X — )*]/o* < 3, then the distribution has thinner tails than the normal (a rarer 
situation). 


B-5d The Chi-Square Distribution 


The chi-square distribution is obtained directly from independent, standard normal random variables. 
Let Z, i = 1, 2,...,, be independent random variables, each distributed as standard normal. Define 
a new random variable as the sum of the squares of the Z;: 


X= 2z. [B.41] 


Then, X has what is known as a chi-square distribution with n degrees of freedom (or df for short). 
We write this as X ~ 2. The dfin a chi-square distribution corresponds to the number of terms in the 
sum in (B.41). The concept of degrees of freedom will play an important role in our statistical and 
econometric analyses. 

The pdf for chi-square distributions with varying degrees of freedom is given in Figure B.9; 
we will not need the formula for this pdf, and so we do not reproduce it here. From equation (B.41), 
it is clear that a chi-square random variable is always nonnegative, and that, unlike the normal 
distribution, the chi-square distribution is not symmetric about any point. It can be shown 
that if X ~ yA then the expected value of X is n [the number of terms in (B.41)], and the variance of 
X is 2n. 


B-5e The f Distribution 


The ¢ distribution is the workhorse in classical statistics and multiple regression analysis. We obtain a 
t distribution from a standard normal and a chi-square random variable. 

Let Z have a standard normal distribution and let X have a chi-square distribution with n degrees 
of freedom. Further, assume that Z and X are independent. Then, the random variable 


T = —— [B.42] 
X/n 
has a ź distribution with n degrees of freedom. We will denote this by T ~ t,. The f distribution gets 
its degrees of freedom from the chi-square random variable in the denominator of (B.42). 
The pdf of the ¢ distribution has a shape similar to that of the standard normal distribution, 
except that it is more spread out and therefore has more area in the tails. The expected value of a 
t distributed random variable is zero (strictly speaking, the expected value exists only forn > 1), 
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FIGURE B.9 The chi-square distribution with various degrees of freedom. 


and the variance is n/(n — 2) for n > 2. (The variance does not exist for n = 2 because the distri- 
bution is so spread out.) The pdf of the rf distribution is plotted in Figure B.10 for various degrees 
of freedom. As the degrees of freedom gets large, the f distribution approaches the standard normal 
distribution. 


B-5f The F Distribution 


Another important distribution for statistics and econometrics is the F distribution. In particular, the F 
distribution will be used for testing hypotheses in the context of multiple regression analysis. 
To define an F random variable, let X; ~ Xi, and X, ~ Xo and assume that X, and X, are inde- 
pendent. Then, the random variable 
(Xı/kı) 


F= (Xk) [B.43] 


has an F distribution with (k,, k2) degrees of freedom. We denote this as F ~ F kı k: The pdf of the F 
distribution with different degrees of freedom is given in Figure B.11. i 

The order of the degrees of freedom in F;, ;, is critical. The integer k, is called the numera- 
tor degrees of freedom because it is associated with the chi-square variable in the numerator. Like- 
wise, the integer k, is called the denominator degrees of freedom because it is associated with the 
chi-square variable in the denominator. This can be a little tricky because (B.43) can also be writ- 
ten as (X,k,)/(X>k,), so that kı appears in the denominator. Just remember that the numerator df is 
the integer associated with the chi-square variable in the numerator of (B.43), and similarly for the 
denominator df. 
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FIGURE B.10 The fdistribution with various degrees of freedom. 


FIGURE B.11 The F, ,, distribution for various degrees of freedom, k, and kz. 


Summary 
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In this Math Refresher, we have reviewed the probability concepts that are needed in econometrics. Most 
of the concepts should be familiar from your introductory course in probability and statistics. Some of the 
more advanced topics, such as features of conditional expectations, do not need to be mastered now—there 
is time for that when these concepts arise in the context of regression analysis in Part 1. 

In an introductory statistics course, the focus is on calculating means, variances, covariances, and 
so on for particular distributions. In Part 1, we will not need such calculations: we mostly rely on the 
properties of expectations, variances, and so on that have been stated in this Math Refresher. 


Key Terms 


Bernoulli (or Binary) Random 
Variable 
Binomial Distribution 
Chi-Square Distribution 
Conditional Distribution 
Conditional Expectation 
Continuous Random Variable 
Correlation Coefficient 


Discrete Random Variable 
Expected Value 

Experiment 

F Distribution 

Independent Random Variables 
Joint Distribution 

Kurtosis 

Law of Iterated Expectations 


Probability Density 
Function (pdf) 
Random Variable 
Skewness 
Standard Deviation 
Standard Normal Distribution 
Standardized Random Variable 
Symmetric Distribution 


Covariance Median t Distribution 
Cumulative Distribution Normal Distribution Uncorrelated Random Variables 
Function (cdf) Pairwise Uncorrelated Random Variance 


Degrees of Freedom Variables 


Problems 


1 Suppose that a high school student is preparing to take the SAT exam. Explain why his or her eventual 
SAT score is properly viewed as a random variable. 


2 Let X be a random variable distributed as Normal(5,4). Find the probabilities of the following events: 
(i) P(X <6). 
(ii) P(X > 4). 
(iii) P(|X — 5| > 1). 


3 Much is made of the fact that certain mutual funds outperform the market year after year (that is, the 
return from holding shares in the mutual fund is higher than the return from holding a portfolio such as 
the S&P 500). For concreteness, consider a 10-year period and let the population be the 4,170 mutual 
funds reported in The Wall Street Journal on January 1, 1995. By saying that performance relative to 
the market is random, we mean that each fund has a 50-50 chance of outperforming the market in any 
year and that performance is independent from year to year. 

(i) If performance relative to the market is truly random, what is the probability that any particular 
fund outperforms the market in all 10 years? 

(ii) Of the 4,170 mutual funds, what is the expected number of funds that will outperform the market 
in all 10 years? 

(iii) Find the probability that at least one fund out of 4,170 funds outperforms the market in all 
10 years. What do you make of your answer? 

(iv) If you have a statistical package that computes binomial probabilities, find the probability that at 
least five funds outperform the market in all 10 years. 
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4 Fora randomly selected county in the United States, let X represent the proportion of adults over age 65 
who are employed, or the elderly employment rate. Then, X is restricted to a value between zero and one. 
Suppose that the cumulative distribution function for X is given by F(x) = 3x — 2x forO =x <1. 
Find the probability that the elderly employment rate is at least .6 (60%). 


5 Just prior to jury selection for O. J. Simpson’s murder trial in 1995, a poll found that about 20% of the 
adult population believed Simpson was innocent (after much of the physical evidence in the case had 
been revealed to the public). Ignore the fact that this 20% is an estimate based on a subsample from the 
population; for illustration, take it as the true percentage of people who thought Simpson was innocent 
prior to jury selection. Assume that the 12 jurors were selected randomly and independently from the 
population (although this turned out not to be true). 

(i) Find the probability that the jury had at least one member who believed in Simpson’s innocence 
prior to jury selection. [Hint: Define the Binomial(12,.20) random variable X to be the number of 
jurors believing in Simpson’s innocence. ] 

(ii) Find the probability that the jury had at least two members who believed in Simpson’s innocence. 
[Hint: P(X = 2) = 1 — P(X S 1) and P(X = 1) = P(X = 0) + P(X = 1)] 


6 (Requires calculus) Let X denote the prison sentence, in years, for people convicted of auto theft in a 
particular state in the United States. Suppose that the pdf of X is given by 


f(x) = (19)x7,0 < x < 3. 
Use integration to find the expected prison sentence. 


7 Ifa basketball player is a 74% free throw shooter, then, on average, how many free throws will he or she 
make in a game with eight free throw attempts? 


8 Suppose that a college student is taking three courses: a two-credit course, a three-credit course, and a 
four-credit course. The expected grade in the two-credit course is 3.5, while the expected grade in the 
three- and four-credit courses is 3.0. What is the expected overall grade point average for the semester? 
(Remember that each course grade is weighted by its share of the total number of units.) 


9 Let X denote the annual salary of university professors in the United States, measured in thousands of 
dollars. Suppose that the average salary is 52.3, with a standard deviation of 14.6. Find the mean and 
standard deviation when salary is measured in dollars. 


10 Suppose that at a large university, college grade point average, GPA, and SAT score, SAT, are related by 
the conditional expectation E(GPA|SAT) = .70 + .002 SAT. 
(i) Find the expected GPA when SAT = 800. Find E(GPA|SAT = 1,400). Comment on the 
difference. 
(ii) If the average SAT in the university is 1,100, what is the average GPA? (Hint: Use Property CE.4.) 
(iii) If a student’s SAT score is 1,100, does this mean he or she will have the GPA found in part (ii)? 
Explain. 


11 (i) Let X be a random variable taking on the values —1 and 1, each with probability 1/2. Find E(X) 
and E(X’). 
(ii) Now let X be a random variable taking on the values 1 and 2, each with probability 1/2. Find 
E(X) and E(1/X). 
(111) Conclude from parts (i) and (ii) that, in general, 


Ele(X)] # g[E(X)] 
for a nonlinear function g(-). 
(iv) Given the definition of the F random variable in equation (B.43), show that 


E(F) = clears 
(Xk) | 
Can you conclude that E(F) = 1? 


12 


13 
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The geometric distribution can be used to model the number of trials before a certain event occurs. For 
example, we might flip a coin repeatedly until the first head appears. If the coin is fair, the probability 
of getting a head on each flip is 0. 5. Furthermore, we may realistically assume that the trials are inde- 
pendent. The flip on which the first head occurs can be represented by a random variable, X. 

For the general geometric distribution, we maintain the assumption of independent trails—which, 
admittedly, is sometimes too strong—but allow the probability of the event occurring on any trial to be 
0 for any 0 < 0 < 1. We assume that this probability is the same from trial to trial. In the coin-flipping 
example, allowing the coin to be biased toward, say, heads, would mean 0 > 0.5. Another example 
would be an unemployed worker repeatedly interviewing for jobs until the first job offer. Then @ is the 
probability of receiving an offer during any particular interview. To follow the geometric distribution, 
we would assume @ is the same for all interviews and that the outcomes are independent across inter- 
views. Both assumptions may be too strong. 

One way to characterize the geometric distribution is to define a sequence of Bernoulli (binary) 
variables, say W,, Wz, W3,.... If W, = 1 then the event occurs on trial k; if W, = 0, it does not occur. 
Assume that the W, are independent across k with the Bernoulli(@) distribution, so that P(W; = 1) = @. 
(i) Let X denote the trial upon which the first event occurs. The possible values of X are 

{1, 2, 3,...}. Show that for any positive integer k, 


PX = k) = (1 — 0)"9. 


[Hint: If X = k, you must observe k — 1 “failures” (zeros) followed by a “success” (one).] 
(ii) Use the formula for a geometric sum to show that 


PX=k =1- (1-05, k= 1,2,.... 


(iii) Suppose you have observed 29 failures in a row. If 0 = 0.04, what is the probability of observing 
a success on the 30” trial? 

(iv) In the setup of part (iii), before conducting any of the trials, what is the probability that the first 
success occurs before the 30” trial? 

(v) Reconcile your answers from parts (iii) and (iv). 

In March of 1985, the NCAA men’s basketball tournament increased its field of teams to 64. Since that 

time, each year of the tournament involves four games pitting a #1 seed against a #16 seed. The #1 

seeds are purportedly awarded to the four most deserving teams. The #16 teams are generally viewed 

as the weakest four teams in the field. In answering this question, we will make some simplifying 

assumptions to make the calculations easier. 

(i) Assume that the probability of a #16 seed beating a #1 seed is p, where 0 < p < 1. (In practice, 
p varies by matchup, but we will assume it is the same across all matchups and years.) Assume 
that the outcomes of #1 vs #16 games are independent of one another. Show that the probability 
that at least one #16 seed wins in a particular year is 1 — (1 — p)*. Evaluate this probability when 
p = 0.02. [Hint: You might define four binary variables, say Z,, Z2, Z3, and Z,,where Z; = 1 if 
game i is won by the #16 seed. Then first compute P(Z,; = 0, Z, = 0, Z = 0, Z, = 0).] 

(ii) Let X be the number of years before a #16 beats a #1 seed in the tournament. Assuming indepen- 
dence across years—a very reasonable assumption—explain why X has a geometric distribution 
and that the probability of “success” on a given trial is 0 = 1 — (1 — p}. 

(iii) In the 2017 NCCA Men’s Tournament, #16 seed University of Maryland, Baltimore County 
defeated #1 seed University of Virginia. It took 33 years for such an upset to occur. Suppose 
p = 0.02. Find P(X = 33). Interpret this probability using the perspective of a basketball observer 
in February 1985. 

(iv) Using p = 0.02, in February 2018 what was the probability that a #16 seed would defeat a #1 
seed in the March 2018 tournament? (It had not happened in the previous 32 years.) Why does 
this differ so much from your answer in part (iii)? 

(v) Derive the general formula 


PX¥ <k =1-(1- p“. 
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Fundamentals of 
Mathematical Statistics 


C-1 Populations, Parameters, and Random Sampling 


Statistical inference involves learning something about a population given the availability of a sam- 
ple from that population. By population, we mean any well-defined group of subjects, which could 
be individuals, firms, cities, or many other possibilities. By “learning,” we can mean several things, 
which are broadly divided into the categories of estimation and hypothesis testing. 

A couple of examples may help you understand these terms. In the population of all working 
adults in the United States, labor economists are interested in learning about the return to education, 
as measured by the average percentage increase in earnings given another year of education. It would 
be impractical and costly to obtain information on earnings and education for the entire working 
population in the United States, but we can obtain data on a subset of the population. Using the data 
collected, a labor economist may report that his or her best estimate of the return to another year of 
education is 7.5%. This is an example of a point estimate. Or, she or he may report a range, such as 
“the return to education is between 5.6% and 9.4%.” This is an example of an interval estimate. 

An urban economist might want to know whether neighborhood crime watch programs are associ- 
ated with lower crime rates. After comparing crime rates of neighborhoods with and without such pro- 
grams in a sample from the population, he or she can draw one of two conclusions: neighborhood watch 
programs do affect crime, or they do not. This example falls under the rubric of hypothesis testing. 

The first step in statistical inference is to identify the population of interest. This may seem obvi- 
ous, but it is important to be very specific. Once we have identified the population, we can specify 
a model for the population relationship of interest. Such models involve probability distributions or 
features of probability distributions, and these depend on unknown parameters. Parameters are simply 
constants that determine the directions and strengths of relationships among variables. In the labor eco- 
nomics example just presented, the parameter of interest is the return to education in the population. 


C-1a Sampling 


For reviewing statistical inference, we focus on the simplest possible setting. Let Ybe a random variable 
representing a population with a probability density function f(y; 6), which depends on the single 
parameter 0. The probability density function (pdf) of Yis assumed to be known except for the value of 0; 
different values of 0 imply different population distributions, and therefore we are interested in the value 
of 0. If we can obtain certain kinds of samples from the population, then we can learn something 
about 0. The easiest sampling scheme to deal with is random sampling. 
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Random Sampling. If Y,, ¥,..., Y,, are independent random variables with a common prob- 
ability density function f(y; 0), then {Y}, . . . , Y,,} is said to be a random sample from f(y; @) [or a 
random sample from the population represented by f(y; 0)]. 


When {Y;,..., Y„} is a random sample from the density f(y; 0), we also say that the Y, are indepen- 
dent, identically distributed (or i.i.d.) random variables from f(y; 0). In some cases, we will not need 
to entirely specify what the common distribution is. 

The random nature of Y,, Y3,..., Y,, in the definition of random sampling reflects the fact that 
many different outcomes are possible before the sampling is actually carried out. For example, if fam- 
ily income is obtained for a sample of n = 100 families in the United States, the incomes we observe 
will usually differ for each different sample of 100 families. Once a sample is obtained, we have a set 
of numbers, say, {y,, Y2, - - - , Ya}, which constitute the data that we work with. Whether or not it is 
appropriate to assume the sample came from a random sampling scheme requires knowledge about 
the actual sampling process. 

Random samples from a Bernoulli distribution are often used to illustrate statistical concepts, and 
they also arise in empirical applications. If Y,, Y>,..., Y,, are independent random variables and each 
is distributed as Bernoulli(@), so that P(Y; = 1) = 0 and P(Y; = 0) = 1 — 9, then {Y,, Y>, . . . , Y,} 
constitutes a random sample from the Bernoulli(@) distribution. As an illustration, consider the airline 
reservation example carried along in Math Refresher B. Each Y; denotes whether customer i shows up 
for his or her reservation; Y; = 1 if passenger i shows up, and Y; = 0 otherwise. Here, @ is the prob- 
ability that a randomly drawn person from the population of all people who make airline reservations 
shows up for his or her reservation. 

For many other applications, random samples can be assumed to be drawn from a normal distri- 
bution. If {Y,,..., Y,,} is a random sample from the Normal( u, o°) population, then the population 
is characterized by two parameters, the mean u and the variance o°. Primary interest usually lies in 
u, but o is of interest in its own right because making inferences about u often requires learning 
about o”. 


C-2 Finite Sample Properties of Estimators 


In this section, we study what are called finite sample properties of estimators. The term “finite 
sample” comes from the fact that the properties hold for a sample of any size, no matter how large 
or small. Sometimes, these are called small sample properties. In Section C-3, we cover “asymptotic 
properties,’ which have to do with the behavior of estimators as the sample size grows without bound. 


C-2a Estimators and Estimates 


To study properties of estimators, we must define what we mean by an estimator. Given a random sam- 
ple {Y}, Y2, . . . , Y,} drawn from a population distribution that depends on an unknown parameter 0, 
an estimator of 0 is a rule that assigns each possible outcome of the sample a value of 0. The rule is 
specified before any sampling is carried out; in particular, the rule is the same regardless of the data 
actually obtained. 
As an example of an estimator, let {Y,,..., Y,,} be a random sample from a population with 
mean u. A natural estimator of u is the average of the random sample: 
n 
Y=n SY, [C.1] 
i=1 
Y is called the sample average but, unlike in Math Refresher A where we defined the sample average 
of a set of numbers as a descriptive statistic, Y is now viewed as an estimator. Given any outcome of the 
random variables Y;,..., Y„, we use the same rule to estimate jz: we simply average them. For actual 
data outcomes {y,,..., y,}, the estimate is just the average in the sample: y = (y, + ya +--+ + y,)/n. 
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City Unemployment Rates 


Suppose we obtain the following sample of unemployment rates for 10 cities in the United States: 


City Unemployment Rate 
1 Sal 
6.4 
9.2 
4.1 
TS 
8.3 
2.6 
3.5 
5.8 
7.5 


oO DAN OO FP Ww PP 


4 


Our estimate of the average city unemployment rate in the United States is y = 6.0. Each sample gen- 
erally results in a different estimate. But the rule for obtaining the estimate is the same, regardless of 
which cities appear in the sample, or how many. 


More generally, an estimator W of a parameter 0 can be expressed as an abstract mathematical 
formula: 


W=h(Y,, Y,...,Y,)s [C.2] 


for some known function A of the random variables Y,, Y5,..., Y„. As with the special case of the 
sample average, W is a random variable because it depends on the random sample: as we obtain 
different random samples from the population, the value of W can change. When a particular set of 
numbers, say, {y,, Yo, - - - , Ya}, is plugged into the function h, we obtain an estimate of 0, denoted 
w = h(y,,...,y,). Sometimes, W is called a point estimator and w a point estimate to distinguish 
these from interval estimators and estimates, which we will come to in Section C-5. 

For evaluating estimation procedures, we study various properties of the probability distribution 
of the random variable W. The distribution of an estimator is often called its sampling distribution, 
because this distribution describes the likelihood of various outcomes of W across different random 
samples. Because there are unlimited rules for combining data to estimate parameters, we need some 
sensible criteria for choosing among estimators, or at least for eliminating some estimators from con- 
sideration. Therefore, we must leave the realm of descriptive statistics, where we compute things such 
as the sample average to simply summarize a body of data. In mathematical statistics, we study the 
sampling distributions of estimators. 


C-2b Unbiasedness 


In principle, the entire sampling distribution of W can be obtained given the probability distribution of 
Y, and the function A. It is usually easier to focus on a few features of the distribution of W in evaluating 
it as an estimator of 0. The first important property of an estimator involves its expected value. 
Unbiased Estimator. An estimator, W of 0, is an unbiased estimator if 

E(W) = 9, [C.3] 


for all possible values of 6. 
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If an estimator is unbiased, then its probability distribution has an expected value equal to the 
parameter it is supposed to be estimating. Unbiasedness does not mean that the estimate we get with 
any particular sample is equal to 0, or even very close to @. Rather, if we could indefinitely draw 
random samples on Y from the population, compute an estimate each time, and then average these 
estimates over all random samples, we would obtain 0. This thought experiment is abstract because, 
in most applications, we just have one random sample to work with. 

For an estimator that is not unbiased, we define its bias as follows. 


Bias of an Estimator. If Wis a biased estimator of 0, its bias is defined 
Bias(W) = E(W) — 8. [C.4] 


Figure C.1 shows two estimators; the first one is unbiased, and the second one has a positive bias. 

The unbiasedness of an estimator and the size of any possible bias depend on the distribution of Y 
and on the function h. The distribution of Y is usually beyond our control (although we often choose a 
model for this distribution): it may be determined by nature or social forces. But the choice of the rule 
h is ours, and if we want an unbiased estimator, then we must choose h accordingly. 

Some estimators can be shown to be unbiased quite generally. We now show that the sample 
average Y is an unbiased estimator of the population mean p, regardless of the underlying population 
distribution. We use the properties of expected values (E.1 and E.2) that we covered in Section B-3: 


E(Y) = a (1/n) Sx) = (1/n) (Sr) = ( (1/n) (Sev, )) 


(1/n (Žu) = (1/n) (np) = n. 


FIGURE €.1 An unbiased estimator, W,, and an estimator with positive bias, W,. 
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For hypothesis testing, we will need to estimate the variance o* from a population with mean u. 


Letting {Y,,..., Y„} denote the random sample from the population with E(Y) = wand Var(Y) = o°, 


define the estimator as 


(Y, = Y), [c.5] 
which is usually called the sample variance. It can be shown that S? is unbiased for o°: E(S?) = o°. 
The division by n — 1, rather than n, accounts for he fact that the mean m is estimated rather than 
known. If u were known, an unbiased estimator of o° would be n-'>/_,(Y; — 2)’, but u is rarely 
known in practice. 

Although unbiasedness has a certain appeal as a property for an estimator—indeed, its antonym, 
“biased,” has decidedly negative connotations—it is not without its problems. One weakness of unbi- 
asedness is that some reasonable, and even some very good, estimators are not unbiased. We will see 
an example shortly. 

Another important weakness of unbiasedness is that unbiased estimators exist that are actually 
quite poor estimators. Consider estimating the mean u from a population. Rather than using the sample 
average Y to estimate u, suppose that, after collecting a sample of size n, we discard all of the observa- 
tions except the first. That is, our estimator of u is simply W = Y,. This estimator is unbiased because 
E(Y,) = u. Hopefully, you sense that ignoring all but the first observation is not a prudent approach to 
estimation: it throws out most of the information in the sample. For example, with n = 100, we obtain 
100 outcomes of the random variable Y, but then we use only the first of these to estimate E(Y). 


C-2c The Sampling Variance of Estimators 


The example at the end of the previous subsection shows that we need additional criteria to evaluate 
estimators. Unbiasedness only ensures that the sampling distribution of an estimator has a mean value 
equal to the parameter it is supposed to be estimating. This is fine, but we also need to know how 
spread out the distribution of an estimator is. An estimator can be equal to 0, on average, but it can also 
be very far away with large probability. In Figure C.2, W; and W, are both unbiased estimators of 0. 
But the distribution of W; is more tightly centered about 0: the probability that W, is greater than any 
given distance from @ is less than the probability that W, is greater than that same distance from 0. 
Using W; as our estimator means that it is less likely that we will obtain a random sample that yields 
an estimate very far from 0. 

To summarize the situation shown in Figure C.2, we rely on the variance (or standard deviation) 
of an estimator. Recall that this gives a single measure of the dispersion in the distribution. The vari- 
ance of an estimator is often called its sampling variance because it is the variance associated with a 
sampling distribution. Remember, the sampling variance is not a random variable; it is a constant, but 
it might be unknown. 

We now obtain the variance of the sample average for estimating the mean u from a population: 


Var(Y) a (1/n) Sx) = ( (1/n’) Ivar( $r.) = (1/n? (È vat, )) 


i=1 i=1 i=] 


(1/n? (žo )- (1/n?)(no?) = o7/n. [C.6] 
i=1 

Notice how we used the properties of variance from Sections B-3 and B-4 (VAR.2 and VAR.4), as 

well as the independence of the Y;. To summarize: If {Y; i= 1,2,...,n} is arandom sample from a 

population with mean p and variance o”, then Y has the same mean as the population, but its sampling 

variance equals the population variance, a7, divided by the sample size. 
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FIGURE €.2 The sampling distributions of two unbiased estimators of 0. 


pdf of W, 


An important implication of Var(Y) = o7/nis that it can be made very close to zero by increasing 
the sample size n. This is a key feature of a reasonable estimator, and we return to it in Section C-3. 

As suggested by Figure C.2, among unbiased estimators, we prefer the estimator with the small- 
est variance. This allows us to eliminate certain estimators from consideration. For a random sample 
from a population with mean u and variance a”, we know that Y is unbiased and Var(Y) = o°/n. 
What about the estimator Y,, which is just the first observation drawn? Because Y, is a random draw 
from the population, Var(Y,) = o°. Thus, the difference between Var(Y,) and Var(Y) can be large 
even for small sample sizes. If n = 10, then Var(Y,) is 10 times as large as Var(Y) = 07/10. This 
gives us a formal way of excluding Y; as an estimator of m. 

To emphasize this point, Table C.1 contains the outcome of a small simulation study. Using the 
statistical package Stata®, 20 random samples of size 10 were generated from a normal distribution, 
with u = 2 and o* = 1; we are interested in estimating u here. For each of the 20 random samples, 
we compute two estimates, y, and y; these values are listed in Table C.1. As can be seen from the 
table, the values for y; are much more spread out than those for y: y; ranges from —0.64 to 4.27, while 
y ranges only from 1.16 to 2.58. Further, in 16 out of 20 cases, y is closer than y; to u = 2. The aver- 
age of y; across the simulations is about 1.89, while that for y is 1.96. The fact that these averages are 
close to 2 illustrates the unbiasedness of both estimators (and we could get these averages closer to 2 
by doing more than 20 replications). But comparing just the average outcomes across random draws 
masks the fact that the sample average Y is far superior to Y, as an estimator of u. 


C-2d Efficiency 


Comparing the variances of Y and Y, in the previous subsection is an example of a general approach 
to comparing different unbiased estimators. 


Relative Efficiency. If W, and W, are two unbiased estimators of 0, W, is efficient relative to 
W, when Var(W,) = Var(W;) for all 6, with strict inequality for at least one value of 0. 
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TABLE C.1 Simulation of Estimators for a Normal(, 1) Distribution with u = 2 


Replication Yı y 
1 —0.64 1.98 
2 1.06 1.43 
3 4.27 1.65 
4 1.03 1.88 
5 3.16 2.34 
6 2.77 2.58 
7 1.68 1.58 
8 2.98 2.23 
9 2.25 1.96 
10 2.04 2.11 
11 0.95 2.15 
12 1.36 1.93 
13 2.62 2.02 
14 2.97 2.10 
15 1.93 2.18 
16 1.14 2.10 
17 2.08 1.94 
18 1.52 2.21 
19 1.33 1.16 
20 1.21 1.75 


Earlier, we showed that, for estimating the population mean u, Var(Y) < Var(Y,) for any value 
of a? whenever n > 1. Thus, Y is efficient relative to Y, for estimating u. We cannot always choose 
between unbiased estimators based on the smallest variance criterion: given two unbiased estimators 
of 0, one can have smaller variance from some values of 0, while the other can have smaller variance 
for other values of 6. 

If we restrict our attention to a certain class of estimators, we can show that the sample average 
has the smallest variance. Problem C.2 asks you to show that Y has the smallest variance among all 
unbiased estimators that are also linear functions of Y,, Y,,..., Y,,. The assumptions are that the Y, 
have common mean and variance, and that they are pairwise uncorrelated. 

If we do not restrict our attention to unbiased estimators, then comparing variances is meaning- 
less. For example, when estimating the population mean u, we can use a trivial estimator that is 
equal to zero, regardless of the sample that we draw. Naturally, the variance of this estimator is zero 
(because it is the same value for every random sample). But the bias of this estimator is — jz, so it is a 
very poor estimator when |u] is large. 

One way to compare estimators that are not necessarily unbiased is to compute the 
mean squared error (MSE) of the estimators. If W is an estimator of 6, then the MSE of W is 
defined as MSE(W) = E[(W — 6)?]. The MSE measures how far, on average, the estimator is away 
from 0. It can be shown that MSE(W) = Var(W) + [Bias(W) ’, so that MSE(W) depends on the 
variance and bias (if any is present). This allows us to compare two estimators when one or both are 
biased. 
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C-3 Asymptotic or Large Sample Properties of Estimators 


In Section C-2, we encountered the estimator Y, for the population mean u, and we saw that, even 
though it is unbiased, it is a poor estimator because its variance can be much larger than that of the 
sample mean. One notable feature of Y, is that it has the same variance for any sample size. It seems 
reasonable to require any estimation procedure to improve as the sample size increases. For estimat- 
ing a population mean u, Y improves in the sense that its variance gets smaller as n gets larger; Y, 
does not improve in this sense. 

We can rule out certain silly estimators by studying the asymptotic or large sample properties 
of estimators. In addition, we can say something positive about estimators that are not unbiased and 
whose variances are not easily found. 

Asymptotic analysis involves approximating the features of the sampling distribution of an esti- 
mator. These approximations depend on the size of the sample. Unfortunately, we are necessarily 
limited in what we can say about how “large” a sample size is needed for asymptotic analysis to be 
appropriate; this depends on the underlying population distribution. But large sample approximations 
have been known to work well for sample sizes as small as n = 20. 


C-3a Consistency 


The first asymptotic property of estimators concerns how far the estimator is likely to be from the 
parameter it is supposed to be estimating as we let the sample size increase indefinitely. 


Consistency. Let W, be an estimator of @ based on a sample Yj, Y>, . . . , Y„ of size n. Then, W, 
is a consistent estimator of 0 if for every e > 0, 


P(|W, — 0| > £) > 0 as n > œ. [C.7] 


If W, is not consistent for 6, then we say it is inconsistent. 

When W, is consistent, we also say that 0 is the probability limit of W,, written 
as plim(W,,) = 0. 

Unlike unbiasedness—which is a feature of an estimator for a given sample size—consistency 
involves the behavior of the sampling distribution of the estimator as the sample size n gets large. To 
emphasize this, we have indexed the estimator by the sample size in stating this definition, and we 
will continue with this convention throughout this section. 

Equation (C.7) looks technical, and it can be rather difficult to establish based on fundamental 
probability principles. By contrast, interpreting (C.7) is straightforward. It means that the distribution 
of W,, becomes more and more concentrated about 0, which roughly means that for larger sample 
sizes, W, is less and less likely to be very far from 0. This tendency is illustrated in Figure C.3. 

If an estimator is not consistent, then it does not help us to learn about 0, even with an unlimited 
amount of data. For this reason, consistency is a minimal requirement of an estimator used in statis- 
tics or econometrics. We will encounter estimators that are consistent under certain assumptions and 
inconsistent when those assumptions fail. When estimators are inconsistent, we can usually find their 
probability limits, and it will be important to know how far these probability limits are from 0. 

As we noted earlier, unbiased estimators are not necessarily consistent, but those whose vari- 
ances shrink to zero as the sample size grows are consistent. This can be stated formally: If W, is an 
unbiased estimator of 0 and Var(W,,) > 0 as n > ©, then plim(W,,) = 0. Unbiased estimators that 
use the entire data sample will usually have a variance that shrinks to zero as the sample size grows, 
thereby being consistent. 

A good example of a consistent estimator is the average of a random sample drawn from a popu- 
lation with mean u and variance a. We have already shown that the sample average is unbiased for u. 
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FIGURE C.3 The sampling distributions of a consistent estimator for three sample sizes. 


In Equation (C.6), we derived Var(Y,,) = o7/n for any sample size n. Therefore, Var(Y,) > 0 as 
n — ©, so Y, is a consistent estimator of u (in addition to being unbiased). 

The conclusion that Y, is consistent for u holds even if Var(Y,,) does not exist. This classic result 
is known as the law of large numbers (LLN). 


Law of Large Numbers. Let Y,, Y,,..., Y, be independent, identically distributed random 
variables with mean u. Then, 


plim(Y,) = u. [C.8] 


The law of large numbers means that, if we are interested in estimating the population average u, we 
can get arbitrarily close to u by choosing a sufficiently large sample. This fundamental result can be 
combined with basic properties of plims to show that fairly complicated estimators are consistent. 


Property PLIM.1: Let 0 be a parameter and define a new parameter, y = g(0), for some 
continuous function g(0). Suppose that plim(W,,) = @. Define an estimator of y by G, = g(W,). 
Then, 


plim(G,,) = y. [C.9] 
This is often stated as 
plim g(W,,) = g(plim W,) [C.10] 


for a continuous function g(@). 

The assumption that g(@) is continuous is a technical requirement that has often been described 
nontechnically as “a function that can be graphed without lifting your pencil from the paper.” Because 
all the functions we encounter in this text are continuous, we do not provide a formal definition of 
a continuous function. Examples of continuous functions are g(@) = a + b0 for constants a and b, 
2(0) = 0°, 2(0) = 1/0, g(0) = Vo, g(0) = exp(@), and many variants on these. We will not need 
to mention the continuity assumption again. 
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As an important example of a consistent but biased estimator, consider estimating the standard 
deviation, g, from a population with mean pw and variance o°. We already claimed that the sample 
variance S = (n — 1)"'D7_,(Y¥; — Y,)? is unbiased for o?. Using the law of large numbers and 
some algebra, S? can also be shown to be consistent for o°. The natural estimator of o = Vo? 
is S, = Vs (where the square root is always the positive square root). S,, which is called the 
sample standard deviation, is not an unbiased estimator because the expected value of the 
square root is not the square root of the expected value (see Section B-3). Nevertheless, by PLIM.1, 
plim S, = V plim S? = Vo = g, so S, is a consistent estimator of a. 


Here are some other useful properties of the probability limit: 


Property PLIM.2: If plim(7,) = æ and plim(U,,) = B, then 


© plim(7, + U,) =a + B; 
Gi) plim(T,U,) = a; 
(iii) plim(7,/U,,) = o/B, provided B # 0. 


These three facts about probability limits allow us to combine consistent estimators in a variety of 
ways to get other consistent estimators. For example, let {Y}, . . . , Y,,} be a random sample of size 
n on annual earnings from the population of workers with a high school education and denote the 
population mean by py. Let {Z,,..., Z,} be a random sample on annual earnings from the population 
of workers with a college education and denote the population mean by uz. We wish to estimate the 
percentage difference in annual earnings between the two groups, which is y = 100: (uz — py)/py. 
(This is the percentage by which average earnings for college graduates differs from average earnings 
for high school graduates.) Because Y, is consistent for uy and Z, is consistent for uz, it follows from 
PLIM.1 and part (iii) of PLIM.2 that 


G, = 100: (Z, — Y,)/Y, 


is a consistent estimator of y. G, is just the percentage difference between Z, and Y, in the sample, 
so it is a natural estimator. G, is not an unbiased estimator of y, but it is still a good estimator except 
possibly when n is small. 


C-3b Asymptotic Normality 


Consistency is a property of point estimators. Although it does tell us that the distribution of the esti- 
mator is collapsing around the parameter as the sample size gets large, it tells us essentially nothing 
about the shape of that distribution for a given sample size. For constructing interval estimators and 
testing hypotheses, we need a way to approximate the distribution of our estimators. Most econo- 
metric estimators have distributions that are well approximated by a normal distribution for large 
samples, which motivates the following definition. 


Asymptotic Normality. Let {Z,:n = 1, 2, ...} be a sequence of random variables, such that 
for all numbers z, 


P(Z, = z) > (z) asn > %, [C.11] 


where ®(z) is the standard normal cumulative distribution function. Then, Z,, is said to have an 
asymptotic standard normal distribution. In this case, we often write Z, “ Normal(0, 1). (The “a” 
above the tilde stands for “asymptotically” or “approximately.”) 

Property (C.11) means that the cumulative distribution function for Z, gets closer and closer to the 
cdf of the standard normal distribution as the sample size n gets large. When asymptotic normality 
holds, for large n we have the approximation P(Z, =< z) ~ ®(z). Thus, probabilities concerning Z, 


can be approximated by standard normal probabilities. 
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The central limit theorem (CLT) is one of the most powerful results in probability and statis- 
tics. It states that the average from a random sample for any population (with finite variance), when 
standardized, has an asymptotic standard normal distribution. 


Central Limit Theorem. Let {Y,, Y,,..., Y„} be arandom sample with mean p and variance o°. 
Then, 


. =e. Vna,- u) 
” o/Vn o 


has an asymptotic standard normal distribution. 

The variable Z, in (C.12) is the standardized version of Y,,: we have subtracted off E(Y,) = u and 
divided by sd(Y,) = o/Vn. Thus, regardless of the population distribution of Y, Z, has mean zero 
and variance one, which coincides with the mean and variance of the standard normal distribution. 
Remarkably, the entire distribution of Z, gets arbitrarily close to the standard normal distribution as n 
gets large. 

The second equality in equation (C.12) expresses the standardized variable as Vn(Y,, — w)/o, 
which shows that we must multiply the difference between the sample mean and the population mean 
by the square root of the sample size in order to obtain a useful limiting distribution. Without the 
multiplication by Vn, we would just have (Y, — 2)/o, which converges in probability to zero. In 
other words, the distribution of (Y, — 2)/o simply collapses to a single point as n > œ, which we 
know cannot be a good approximation to the distribution of (Y, — u)/ø for reasonable sample sizes. 
Multiplying by V/n ensures that the variance of Z, remains constant. Practically, we often treat Y, as 
being approximately normally distributed with mean mw and variance o7/n, and this gives us the cor- 
rect statistical procedures because it leads to the standardized variable in equation (C.12). 

Most estimators encountered in statistics and econometrics can be written as functions of sample 
averages, in which case we can apply the law of large numbers and the central limit theorem. When 
two consistent estimators have asymptotic normal distributions, we choose the estimator with the 
smallest asymptotic variance. 

In addition to the standardized sample average in (C.12), many other statistics that depend on 
sample averages turn out to be asymptotically normal. An important one is obtained by replacing o 
with its consistent estimator S, in equation (C.12): 


[C.12] 


moi 
C.13 
S,/Vn ie 


also has an approximate standard normal distribution for large n. The exact (finite sample) distribu- 
tions of (C.12) and (C.13) are definitely not the same, but the difference is often small enough to be 
ignored for large n. 

Throughout this section, each estimator has been subscripted by n to emphasize the nature of 
asymptotic or large sample analysis. Continuing this convention clutters the notation without providing 
additional insight, once the fundamentals of asymptotic analysis are understood. Henceforth, we drop 
the n subscript and rely on you to remember that estimators depend on the sample size, and properties 
such as consistency and asymptotic normality refer to the growth of the sample size without bound. 


C-4 General Approaches to Parameter Estimation 


Until this point, we have used the sample average to illustrate the finite and large sample properties 
of estimators. It is natural to ask: Are there general approaches to estimation that produce estimators 
with good properties, such as unbiasedness, consistency, and efficiency? 

The answer is yes. A detailed treatment of various approaches to estimation is beyond the scope 
of this text; here, we provide only an informal discussion. A thorough discussion is given in Larsen 
and Marx (1986, Chapter 5). 
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C-4a Method of Moments 


Given a parameter 0 appearing in a population distribution, there are usually many ways to obtain 
unbiased and consistent estimators of 0. Trying all different possibilities and comparing them on the 
basis of the criteria in Sections C-2 and C-3 is not practical. Fortunately, some methods have been shown 
to have good general properties, and, for the most part, the logic behind them is intuitively appealing. 

In the previous sections, we have studied the sample average as an unbiased estimator of the popu- 
lation average and the sample variance as an unbiased estimator of the population variance. These 
estimators are examples of method of moments estimators. Generally, method of moments estimation 
proceeds as follows. The parameter 0 is shown to be related to some expected value in the distribution 
of Y, usually E(Y) or E( Y) (although more exotic choices are sometimes used). Suppose, for example, 
that the parameter of interest, 0, is related to the population mean as 0 = g(w) for some function g. 
Because the sample average Y is an unbiased and consistent estimator of u, it is natural to replace u 
with Y, which gives us the estimator g(Y) of @. The estimator g(Y) is consistent for 0, and if g( u) is 
a linear function of u, then g(Y) is unbiased as well. What we have done is replace the population 
moment, u, with its sample counterpart, Y. This is where the name “method of moments” comes from. 

We cover two additional method of moments estimators that will be useful for our discus- 
sion of regression analysis. Recall that the covariance between two random variables X and Y is 
defined as Syy = E[(X — py)(Y — py)]. The method of moments suggests estimating oxy by 
n'>"_,(X, — X)(¥, — Y). This is a consistent estimator of oyy, but it turns out to be biased for 
essentially the same reason that the sample variance is biased if n, rather than n — 1, is used as the 
divisor. The sample covariance is defined as 


X(x- A) = Y). [C.14] 


It can be shown that this is an unbiased estimator of ayy. (Replacing n with n — 1 makes no difference 
as the sample size grows indefinitely, so this estimator is still consistent.) 

As we discussed in Section B-4, the covariance between two variables is often difficult to 
interpret. Usually, we are more interested in correlation. Because the population correlation is 
Pxy = Cyxy/(oyoy), the method of moments suggests estimating pyy as 


n 


S X(x -XY - Y) 
Rys 2 ZI [C.15] 


_ SySy ~ (Sex E v) (So 7 D 


i=1 i=1 


which is called the sample correlation coefficient (or sample correlation for short). Notice that we 
have canceled the division by n — 1 in the sample covariance and the sample standard deviations. In 
fact, we could divide each of these by n, and we would arrive at the same final formula. 

It can be shown that the sample correlation coefficient is always in the interval 
[—1,1], as it should be. Because Syy, Sy, and Sy are consistent for the corresponding population 
parameter, Ryy is a consistent estimator of the population correlation, pyy. However, Ryy is a biased 
estimator for two reasons. First, Sy and Sy are biased estimators of oy and oy, respectively. Second, 
Ryy is a ratio of estimators, so it would not be unbiased, even if Sy and Sy were. For our purposes, 
this is not important, although the fact that no unbiased estimator of pyy exists is a classical result in 
mathematical statistics. 


C-4b Maximum Likelihood 


Another general approach to estimation is the method of maximum likelihood, a topic covered in 
many introductory statistics courses. A brief summary in the simplest case will suffice here. Let 
{Y,, Yo,..., Y,} be a random sample from the population distribution f(y; 0). Because of the random 
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sampling assumption, the joint distribution of {Y,, Y,,..., Y,,} is simply the product of the densi- 
ties: f(y,; 0)f(y2; 0) - - -f(y,3 0). In the discrete case, this is P(Y, = y,, Yọ = y.,..., Y, = y,). Now, 
define the likelihood function as 


L(0; Y, Y,, rawy Y,) = f(¥,; 0)f(Y; 0) Pe -F(Y 0), 


which is a random variable because it depends on the outcome of the random sample {Y,, Y>, . . . , Y,,}. 
The maximum likelihood estimator of 0, call it W, is the value of 6 that maximizes the likelihood 
function. (This is why we write L as a function of 0, followed by the random sample.) Clearly, this 
value depends on the random sample. The maximum likelihood principle says that, out of all the pos- 
sible values for 0, the value that makes the likelihood of the observed data largest should be chosen. 
Intuitively, this is a reasonable approach to estimating 0. 

Usually, it is more convenient to work with the log-likelihood function, which is obtained by 
taking the natural log of the likelihood function: 


E0) = loglL(0; Yı, Yz,- .. , Y,)] = Xlogl F(Y; 0)] = D6; X), [C.16] 
i=1 i=1 
where we use the fact that the log of the product is the sum of the logs. The function 
£(0; X;) = log[f(Y;; 0) | is the log-likelihood function for random draw i. Because (C.16) is the sum 
of independent, identically distributed random variables, analyzing estimators that come from (C.16) 
is relatively easy. 

Maximum likelihood estimation (MLE) is usually consistent and sometimes unbiased. But so are 
many other estimators. The widespread appeal of MLE is that it is generally the most asymptotically 
efficient estimator when the population model f(y; @) is correctly specified. In addition, the MLE is 
sometimes the minimum variance unbiased estimator; that is, it has the smallest variance among 
all unbiased estimators of 0. [See Larsen and Marx (1986, Chapter 5) for verification of these claims. ] 

In Chapter 17, we will need maximum likelihood to estimate the parameters of more advanced 
econometric models. In econometrics, we are almost always interested in the distribution of Y condi- 
tional on a set of explanatory variables, say, X1, X2, ..., X,. Then, we replace the density in (C.16) with 
FOX- - -Xx 0i -- -, 0p), where this density is allowed to depend on p parameters, 6), ... , 0p- 
Fortunately, for successful application of maximum likelihood methods, we do not need to delve much 
into the computational issues or the large-sample statistical theory. Wooldridge (2010, Chapter 13) 
covers the theory of MLE. 


C-4c Least Squares 


A third kind of estimator, and one that plays a major role throughout the text, is called a least squares 
estimator. We have already seen an example of least squares: the sample mean, Y, is a least squares 
estimator of the population mean, u. We already know Y is a method of moments estimator. What 
makes it a least squares estimator? It can be shown that the value of m that makes the sum of squared 
deviations 


(Y; = m)? 


i=1 


as small as possible is m = Y. Showing this is not difficult, but we omit the algebra. 

For some important distributions, including the normal and the Bernoulli, the sample average 
Y is also the maximum likelihood estimator of the population mean u. Thus, the principles of least 
squares, method of moments, and maximum likelihood often result in the same estimator. In other 
cases, the estimators are similar but not identical. 
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C-5 Interval Estimation and Confidence Intervals 


C-5a The Nature of Interval Estimation 


A point estimate obtained from a particular sample does not, by itself, provide enough information for 
testing economic theories or for informing policy discussions. A point estimate may be the research- 
er’s best guess at the population value, but, by its nature, it provides no information about how close 
the estimate is “likely” to be to the population parameter. As an example, suppose a researcher reports, 
on the basis of a random sample of workers, that job training grants increase hourly wage by 6.4%. 
How are we to know whether or not this is close to the effect in the population of workers who could 
have been trained? Because we do not know the population value, we cannot know how close an esti- 
mate is for a particular sample. However, we can make statements involving probabilities, and this is 
where interval estimation comes in. 

We already know one way of assessing the uncertainty in an estimator: find its sampling standard 
deviation. Reporting the standard deviation of the estimator, along with the point estimate, provides 
some information on the accuracy of our estimate. However, even if the problem of the standard devi- 
ation’s dependence on unknown population parameters is ignored, reporting the standard deviation 
along with the point estimate makes no direct statement about where the population value is likely 
to lie in relation to the estimate. This limitation is overcome by constructing a confidence interval. 

We illustrate the concept of a confidence interval with an example. Suppose the population has a 
Normal(, 1) distribution and let {Y,,..., Y„} be a random sample from this population. (We assume 
that the variance of the population is known and equal to unity for the sake of illustration; we then 
show what to do in the more realistic case that the variance is unknown.) The sample average, Y, has a 
normal distribution with mean u and variance 1/n: Y ~ Normal(, 1/n). From this, we can standard- 
ize Y, and, because the standardized version of Y has a standard normal distribution, we have 


p196 < 2E < 196) = 95 
i 1/Vn ` n 


The event in parentheses is identical to the event Y — 1.96/Vn < u < Y + 1.96/Vn, so 


P(Y — 1.96/Vn < u < Y + 1.96 Vn) = .95. [C.17] 


Equation (C.17) is interesting because it tells us that the probability that the random interval 
[Y — 1.96/Vn, Y + 1.96/V/n] contains the population mean p is .95, or 95%. This information 
allows us to construct an interval estimate of u, which is obtained by plugging in the sample outcome 
of the average, y. Thus, 


[y — 1.96/Vn, y + 1.96/Vn] [C.18] 


is an example of an interval estimate of yw. It is also called a 95% confidence interval. A shorthand 
notation for this interval is y + 1.96/V/n. 

The confidence interval in equation (C.18) is easy to compute, once the sample data 
{yn Yo +--+» Y,} are observed; y is the only factor that depends on the data. For example, suppose 
that n = 16 and the average of the 16 data points is 7.3. Then, the 95% confidence interval for u is 
7.3 + 1.96/V16 = 7.3 + .49, which we can write in interval form as [6.81,7.79]. By construction, 
y = 7.3 is in the center of this interval. 

Unlike its computation, the meaning of a confidence interval is more difficult to understand. 
When we say that equation (C.18) is a 95% confidence interval for u, we mean that the random 
interval 


[Y — 1.96/Vn, Y + 1.96/V/n] [C.19] 
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contains u with probability .95. In other words, before the random sample is drawn, there is a 95% 
chance that (C.19) contains u. Equation (C.19) is an example of an interval estimator. It is a random 
interval, because the endpoints change with different samples. 

A confidence interval is often interpreted as follows: “The probability that u is in the interval 
(C.18) is .95.” This is incorrect. Once the sample has been observed and y has been computed, the 
limits of the confidence interval are simply numbers (6.81 and 7.79 in the example just given). The 
population parameter, u, though unknown, is also just some number. Therefore, jx either is or is not 
in the interval (C.18) (and we will never know with certainty which is the case). Probability plays no 
role once the confidence interval is computed for the particular data at hand. The probabilistic inter- 
pretation comes from the fact that for 95% of all random samples, the constructed confidence interval 
will contain u. 

To emphasize the meaning of a confidence interval, Table C.2 contains calculations for 20 ran- 
dom samples (or replications) from the Normal(2,1) distribution with sample size n = 10. For each 
of the 20 samples, y is obtained, and (C.18) is computed as y + 1.96/V/10 = y + .62 (each rounded 
to two decimals). As you can see, the interval changes with each random sample. Nineteen of the 
20 intervals contain the population value of u. Only for replication number 19 is u not in the confi- 
dence interval. In other words, 95% of the samples result in a confidence interval that contains w. This 
did not have to be the case with only 20 replications, but it worked out that way for this particular 
simulation. 


TABLE C.2 Simulated Confidence Intervals from a Normal(,, 1) Distribution 


with u = 2 
Replication 95% Interval Contains u? 
1 1.98 (1.36,2.60) Yes 
2 1.43 (0.81,2.05) Yes 
3 1.65 (1.03,2.27) Yes 
4 1.88 (1.26,2.50) Yes 
5 2.34 (1.72,2.96) Yes 
6 2.58 (1.96,3.20) Yes 
7 1.58 (.96,2.20) Yes 
8 2.23 (1.61,2.85) Yes 
9 1.96 (1.34,2.58) Yes 
10 2.11 (1.49,2.73) Yes 
11 2.15 (E53 27) Yes 
12 1.93 (1.31,2.55) Yes 
13 2.02 (1.40,2.64) Yes 
14 2.10 (1.48,2.72) Yes 
15 2.18 (1.56,2.80) Yes 
16 2.10 (1.48,2.72) Yes 
17 1.94 (1.32,2.56) Yes 
18 2.21 (1.59,2.83) Yes 
19 1.16 (.54,1.78) No 
20 1:75 (1.13,2.37) Yes 
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C-5b Confidence Intervals for the Mean from a Normally Distributed 
Population 


The confidence interval derived in equation (C.18) helps illustrate how to construct and interpret con- 
fidence intervals. In practice, equation (C.18) is not very useful for the mean of a normal population 
because it assumes that the variance is known to be unity. It is easy to extend (C.18) to the case where 
the standard deviation ø is known to be any value: the 95% confidence interval is 


[y — 1.960/Vn, y + 1.960/Vn]. [C.20] 


Therefore, provided ø is known, a confidence interval for u is readily constructed. To allow for 
unknown g, we must use an estimate. Let 


s=(—30,- 97) [e21) 


denote the sample standard deviation. Then, we obtain a confidence interval that depends entirely on 
the observed data by replacing ø in equation (C.20) with its estimate, s. Unfortunately, this does not 
preserve the 95% level of confidence because s depends on the particular sample. In other words, the 
random interval [Y + 1.96(S/V/n) | no longer contains u with probability .95 because the constant o 
has been replaced with the random variable S. 

How should we proceed? Rather than using the standard normal distribution, we must rely on the 
t distribution. The f distribution arises from the fact that 


ro [C.22] 
~h- $ 
S/Vn ! 
where Y is the sample average and S is the sample standard deviation of the random sample 
{Y,,..., Y,,}. We will not prove (C.22); a careful proof can be found in a variety of places [for 


example, Larsen and Marx (1986, Chapter 7)]. 

To construct a 95% confidence interval, let c denote the 97.5" percentile in the ¢,_, distri- 
bution. In other words, c is the value such that 95% of the area in the ¢,_, is between —c and c: 
P(—c < t,-ı < c) = .95. (The value of c depends on the degrees of freedom n — 1, but we do not 


FIGURE C.4 The 97.5" percentile, c, in a t distribution. 


area = .95 


area = .025 area = .025 
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make this explicit.) The choice of c is illustrated in Figure C.4. Once c has been properly chosen, the 
random interval [Y — c-S/Vn, Y + c:S/Vn] contains u with probability .95. For a particular sample, 
the 95% confidence interval is calculated as 


[y — es/ Vn, y + es/Vn]. [C.23] 


The values of c for various degrees of freedom can be obtained from Table G.2 in Statistical Tables. 
For example, if n = 20, so that the dfis n — 1 = 19, then c = 2.093. Thus, the 95% confidence 
interval is [y + 2.093(s/V/20) ], where y and s are the values obtained from the sample. Even if 
s = o (which is very unlikely), the confidence interval in (C.23) is wider than that in (C.20) because 
c > 1.96. For small degrees of freedom, (C.23) is much wider. 

More generally, let c, denote the 100(1 — œ) percentile in the 1, 
100(1 — a)% confidence interval is obtained as 


[y — caps/Vn, Y + Cons/Vn]. [C.24] 


Obtaining can requires choosing a and knowing the degrees of freedom n — 1; then, Table G.2 can be 
used. For the most part, we will concentrate on 95% confidence intervals. 

There is a simple way to remember how to construct a confidence interval for the mean of a nor- 
mal distribution. Recall that sd(Y) = o/Vn. Thus, s/Vn is the point estimate of sd(Y). The associ- 
ated random variable, S/n, is sometimes called the standard error of Y. Because what shows up in 
formulas is the point estimate s/n, we define the standard error of y as se(y) = s/Vn. Then, (C.24) 
can be written in shorthand as 


_, distribution. Then, a 


[Y £ cyn'se(y)]. [C.25] 


This equation shows why the notion of the standard error of an estimate plays an important role in 
econometrics. 


Effect of Job Training Grants on Worker Productivity 


Holzer, Block, Cheatham, and Knott (1993) studied the effects of job training grants on worker pro- 
ductivity by collecting information on “scrap rates” for a sample of Michigan manufacturing firms 
receiving job training grants in 1988. Table C.3 lists the scrap rates—measured as number of items 
per 100 produced that are not usable and therefore need to be scrapped—for 20 firms. Each of these 
firms received a job training grant in 1988; there were no grants awarded in 1987. We are interested in 
constructing a confidence interval for the change in the scrap rate from 1987 to 1988 for the popula- 
tion of all manufacturing firms that could have received grants. 

We assume that the change in scrap rates has a normal distribution. Because n = 20, a 95% con- 
fidence interval for the mean change in scrap rates p is [y + 2.093-se(y) ], where se(y) = s/Vn. The 
value 2.093 is the 97.5" percentile in a tọ distribution. For the particular sample values, y = —1.15 
and se(y) = .54 (each rounded to two decimals), so the 95% confidence interval is [—2.28, —.02]. 
The value zero is excluded from this interval, so we conclude that, with 95% confidence, the average 
change in scrap rates in the population is not zero. 
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TABLE C.3 Scrap Rates for 20 Michigan Manufacturing Firms 


Firm 1987 1988 Change 
1 10 3 =y 
2 1 1 0 
3 6 5 =] 
4 45 5 05 
5 E25 1.54 .29 
6 1.3 1.5 2 
il 1.06 8 —.26 
8 3 2 =1 
9 8.18 .67 Seo 

10 1.67 1.17 —5 
11 98 51 —.47 
12 1 5 = 
13 45 61 16 
14 5.03 6.7 1.67 
15 8 4 —4 
16 9 7 —2 
17 18 19 1 
18 .28 2 — .08 
19 T 5 z2 
20 3.97 3.83 —.14 
Average 4.38 3.23 NNS 


At this point, Example C.2 is mostly illustrative because it has some potentially serious flaws as 
an econometric analysis. Most importantly, it assumes that any systematic reduction in scrap rates 
is due to the job training grants. But many things can happen over the course of the year to change 
worker productivity. From this analysis, we have no way of knowing whether the fall in average scrap 
rates is attributable to the job training grants or if, at least partly, some external force is responsible. 


C-5c A Simple Rule of Thumb for a 95% Confidence Interval 


The confidence interval in (C.25) can be computed for any sample size and any confidence level. As 
we saw in Section B-5, the ¢ distribution approaches the standard normal distribution as the degrees of 
freedom gets large. In particular, for a = .05, cy. > 1.96 as n > œ, although c..,. is always greater than 
1.96 for each n. A rule of thumb for an approximate 95% confidence interval is 


[y + 2-se(y)]. [C.26] 


In other words, we obtain y and its standard error and then compute y plus or minus twice its 
standard error to obtain the confidence interval. This is slightly too wide for very large n, and it is too 
narrow for small n. As we can see from Example C.2, even for n as small as 20, (C.26) is in the ball- 
park for a 95% confidence interval for the mean from a normal distribution. This means we can get 
pretty close to a 95% confidence interval without having to refer to t tables. 
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C-5d Asymptotic Confidence Intervals for Nonnormal Populations 


In some applications, the population is clearly nonnormal. A leading case is the Bernoulli distribution, 
where the random variable takes on only the values zero and one. In other cases, the nonnormal popu- 
lation has no standard distribution. This does not matter, provided the sample size is sufficiently large 
for the central limit theorem to give a good approximation for the distribution of the sample average Y. 
For large n, an approximate 95% confidence interval is 


[y + 1.96-se(y)], [C.27] 


where the value 1.96 is the 97.5" percentile in the standard normal distribution. Mechanically, com- 
puting an approximate confidence interval does not differ from the normal case. A slight difference 
is that the number multiplying the standard error comes from the standard normal distribution, rather 
than the ¢ distribution, because we are using asymptotics. Because the ¢ distribution approaches the 
standard normal as the df increases, equation (C.25) is also perfectly legitimate as an approximate 
95% interval; some prefer this to (C.27) because the former is exact for normal populations. 


Race Discrimination in Hiring 


The Urban Institute conducted a study in 1988 in Washington, D.C., to examine the extent of race 
discrimination in hiring. Five pairs of people interviewed for several jobs. In each pair, one person 
was black and the other person was white. They were given résumés indicating that they were virtu- 
ally the same in terms of experience, education, and other factors that determine job qualification. 
The idea was to make individuals as similar as possible with the exception of race. Each person in a 
pair interviewed for the same job, and the researchers recorded which applicant received a job offer. 
This is an example of a matched pairs analysis, where each trial consists of data on two people (or 
two firms, two cities, and so on) that are thought to be similar in many respects but different in one 
important characteristic. 

Let 0, denote the probability that the black person is offered a job and let 04, be the probability 
that the white person is offered a job. We are primarily interested in the difference, 0, — Ow. Let B, 
denote a Bernoulli variable equal to one if the black person gets a job offer from employer i, and zero 
otherwise. Similarly, W; = 1 if the white person gets a job offer from employer i, and zero otherwise. 
Pooling across the five pairs of people, there were a total of n = 241 trials (pairs of interviews with 
employers). Unbiased estimators of 0, and 0w are B and W, the fractions of interviews for which 
blacks and whites were offered jobs, respectively. 

To put this into the framework of computing a confidence interval for a population mean, define 
a new variable Y, = B, — W,. Now, Y, can take on three values: —1 if the black person did not get the 
job but the white person did, 0 if both people either did or did not get the job, and | if the black person 
got the job and the white person did not. Then, u = E(Y,;) = E(B;) — E(W;) = 65 — Oy. 

The distribution of Y; is certainly not normal—it is discrete and takes on only three values. 
Nevertheless, an approximate confidence interval for 6; — Ow can be obtained by using large sample 
methods. 

The data from the Urban Institute audit study are in the file AUDIT. Using the 241 observed 
data points, b = .224 and w = .357, so y = .224 — .357 = —.133. Thus, 22.4% of black applicants 
were offered jobs, while 35.7% of white applicants were offered jobs. This is prima facie evidence of 
discrimination against blacks, but we can learn much more by computing a confidence interval for pw. 
To compute an approximate 95% confidence interval, we need the sample standard deviation. This 
turns out to be s = .482 [using equation (C.21)]. Using (C.27), we obtain a 95% CI for u = 0, — Ow 
as —.133 + 1.96(.482/V/241) = —.133 + .031 = [—.164, —.102]. The approximate 99% CI is 
—.133 + 2.58(.482/V/241) = [—.213, —.053]. Naturally, this contains a wider range of values than 
the 95% CI. But even the 99% CI does not contain the value zero. Thus, we are very confident that the 
population difference 0, — Ow is not zero. 
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Before we turn to hypothesis testing, it is useful to review the various population and sample 
quantities that measure the spreads in the population distributions and the sampling distributions of 
the estimators. These quantities appear often in statistical analysis, and extensions of them are impor- 
tant for the regression analysis in the main text. The quantity ø is the (unknown) population standard 
deviation; it is a measure of the spread in the distribution of Y. When we divide ø by Vn, we obtain 
the sampling standard deviation of Y (the sample average). While ø is a fixed feature of the popula- 
tion, sd(Y) = o/V*n shrinks to zero as n > ©: our estimator of u gets more and more precise as the 
sample size grows. 

The estimate of o for a particular sample, s, is called the sample standard deviation because it 
is obtained from the sample. (We also call the underlying random variable, S, which changes across 
different samples, the sample standard deviation.) Like y as an estimate of u, s is our “best guess” 
at o given the sample at hand. The quantity s/V/n is what we call the standard error of y, and it is 
our best estimate of o/Vn. Confidence intervals for the population parameter u depend directly on 
se(y) = s/Vn. Because this standard error shrinks to zero as the sample size grows, a larger sample 
size generally means a smaller confidence interval. Thus, we see clearly that one benefit of more data 
is that they result in narrower confidence intervals. The notion of the standard error of an estimate, 
which in the vast majority of cases shrinks to zero at the rate 1/V/n, plays a fundamental role in 
hypothesis testing (as we will see in the next section) and for confidence intervals and testing in the 
context of multiple regression (as discussed in Chapter 4). 


C-6 Hypothesis Testing 


So far, we have reviewed how to evaluate point estimators, and we have seen—in the case of a popu- 
lation mean—how to construct and interpret confidence intervals. But sometimes the question we are 
interested in has a definite yes or no answer. Here are some examples: (1) Does a job training program 
effectively increase average worker productivity? (see Example C.2); (2) Are blacks discriminated 
against in hiring? (see Example C.3); (3) Do stiffer state drunk driving laws reduce the number of 
drunk driving arrests? Devising methods for answering such questions, using a sample of data, is 
known as hypothesis testing. 


C-6a Fundamentals of Hypothesis Testing 


To illustrate the issues involved with hypothesis testing, consider an election example. Suppose there 
are two candidates in an election, Candidates A and B. Candidate A is reported to have received 42% 
of the popular vote, while Candidate B received 58%. These are supposed to represent the true per- 
centages in the voting population, and we treat them as such. 

Candidate A is convinced that more people must have voted for him, so he would like to investi- 
gate whether the election was rigged. Knowing something about statistics, Candidate A hires a con- 
sulting agency to randomly sample 100 voters to record whether or not each person voted for him. 
Suppose that, for the sample collected, 53 people voted for Candidate A. This sample estimate of 53% 
clearly exceeds the reported population value of 42%. Should Candidate A conclude that the election 
was indeed a fraud? 

While it appears that the votes for Candidate A were undercounted, we cannot be certain. Even if 
only 42% of the population voted for Candidate A, it is possible that, in a sample of 100, we observe 
53 people who did vote for Candidate A. The question is: How strong is the sample evidence against 
the officially reported percentage of 42%? 

One way to proceed is to set up a hypothesis test. Let 0 denote the true proportion of the popula- 
tion voting for Candidate A. The hypothesis that the reported results are accurate can be stated as 


Hy: 6 = .42 [C.28] 
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This is an example of a null hypothesis. We always denote the null hypothesis by Ho. In hypothesis 
testing, the null hypothesis plays a role similar to that of a defendant on trial in many judicial systems: 
just as a defendant is presumed to be innocent until proven guilty, the null hypothesis is presumed to 
be true until the data strongly suggest otherwise. In the current example, Candidate A must present 
fairly strong evidence against (C.28) in order to win a recount. 

The alternative hypothesis in the election example is that the true proportion voting for 
Candidate A in the election is greater than .42: 


H,: 0 > 42. [C.29] 


In order to conclude that Hy is false and that H; is true, we must have evidence “beyond reason- 
able doubt” against Hj). How many votes out of 100 would be needed before we feel the evidence is 
strongly against Hy? Most would agree that observing 43 votes out of a sample of 100 is not enough 
to overturn the original election results; such an outcome is well within the expected sampling varia- 
tion. On the other hand, we do not need to observe 100 votes for Candidate A to cast doubt on Ho. 
Whether 53 out of 100 is enough to reject Ho is much less clear. The answer depends on how we 
quantify “beyond reasonable doubt.” 

Before we turn to the issue of quantifying uncertainty in hypothesis testing, we should head off 
some possible confusion. You may have noticed that the hypotheses in equations (C.28) and (C.29) 
do not exhaust all possibilities: it could be that @ is less than .42. For the application at hand, we are 
not particularly interested in that possibility; it has nothing to do with overturning the results of the 
election. Therefore, we can just state at the outset that we are ignoring alternatives 0 with 0 < .42. 
Nevertheless, some authors prefer to state null and alternative hypotheses so that they are exhaustive, 
in which case our null hypothesis should be Hy: 0 = .42. Stated in this way, the null hypothesis is a 
composite null hypothesis because it allows for more than one value under Hp. [By contrast, equa- 
tion (C.28) is an example of a simple null hypothesis.] For these kinds of examples, it does not mat- 
ter whether we state the null as in (C.28) or as a composite null: the most difficult value to reject if 
0 = 42 is 0 = .42. (That is, if we reject the value 0 = .42, against 0 > .42, then logically we must 
reject any value less than .42.) Therefore, our testing procedure based on (C.28) leads to the same test 
as if Hp: 0 = .42. In this text, we always state a null hypothesis as a simple null hypothesis. 

In hypothesis testing, we can make two kinds of mistakes. First, we can reject the null hypothesis 
when it is in fact true. This is called a Type I error. In the election example, a Type I error occurs if 
we reject Hy when the true proportion of people voting for Candidate A is in fact .42. The second kind 
of error is failing to reject Hy when it is actually false. This is called a Type II error. In the election 
example, a Type II error occurs if 0 > .42 but we fail to reject Ho. 

After we have made the decision of whether or not to reject the null hypothesis, we have either 
decided correctly or we have committed an error. We will never know with certainty whether an error 
was committed. However, we can compute the probability of making either a Type I or a Type II error. 
Hypothesis testing rules are constructed to make the probability of committing a Type I error fairly 
small. Generally, we define the significance level (or simply the /evel) of a test as the probability of a 
Type I error; it is typically denoted by a. Symbolically, we have 


a = P(Reject Ho|Ho). [C.30] 


The right-hand side is read as: “The probability of rejecting Hy given that Hp is true.” 

Classical hypothesis testing requires that we initially specify a significance level for a test. When 
we specify a value for a, we are essentially quantifying our tolerance for a Type I error. Common val- 
ues for a are .10, .05, and .01. If œ = .05, then the researcher is willing to falsely reject Hy 5% of the 
time, in order to detect deviations from Hp. 

Once we have chosen the significance level, we would then like to minimize the probability of a 
Type II error. Alternatively, we would like to maximize the power of a test against all relevant alter- 
natives. The power of a test is just one minus the probability of a Type II error. Mathematically, 


m(@) = P(Reject H.|@) = 1 — P(Type II|@), 
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where 0 denotes the actual value of the parameter. Naturally, we would like the power to equal unity 
whenever the null hypothesis is false. But this is impossible to achieve while keeping the significance 
level small. Instead, we choose our tests to maximize the power for a given significance level. 


C-6b Testing Hypotheses about the Mean in a Normal Population 


In order to test a null hypothesis against an alternative, we need to choose a test statistic (or statistic, 
for short) and a critical value. The choices for the statistic and critical value are based on convenience 
and on the desire to maximize power given a significance level for the test. In this subsection, we 
review how to test hypotheses for the mean of a normal population. 

A test statistic, denoted T, is some function of the random sample. When we compute the sta- 
tistic for a particular outcome, we obtain an outcome of the test statistic, which we will denote by t. 

Given a test statistic, we can define a rejection rule that determines when Ho is rejected in 
favor of H,. In this text, all rejection rules are based on comparing the value of a test statistic, t, to a 
critical value, c. The values of ¢ that result in rejection of the null hypothesis are collectively known 
as the rejection region. To determine the critical value, we must first decide on a significance level 
of the test. Then, given a, the critical value associated with a is determined by the distribution of T, 
assuming that Hp is true. We will write this critical value as c, suppressing the fact that it depends 
ona. 

Testing hypotheses about the mean u from a Normal( u, o°) population is straightforward. The 
null hypothesis is stated as 


Ho: u = Mo» [C.31] 


where [uy is a value that we specify. In the majority of applications, uo = 0, but the general case is no 
more difficult. 

The rejection rule we choose depends on the nature of the alternative hypothesis. The three alter- 
natives of interest are 


Hy: u > po, [C.32] 

Hy: U < Mo, [C.33] 
and 

H;i: u # ho. [C.34] 


Equation (C.32) gives a one-sided alternative, as does (C.33). When the alternative hypothesis is 
(C.32), the null is effectively Hp: u = Mo, because we reject Hy only when u > po. This is appropri- 
ate when we are interested in the value of u only when yp is at least as large as mọ. Equation (C.34) 
is a two-sided alternative. This is appropriate when we are interested in any departure from the null 
hypothesis. 

Consider first the alternative in (C.32). Intuitively, we should reject Hp in favor of H, when the 
value of the sample average, y, is “sufficiently” greater than jj. But how should we determine when y 
is large enough for H, to be rejected at the chosen significance level? This requires knowing the prob- 
ability of rejecting the null hypothesis when it is true. Rather than working directly with y, we use its 
standardized version, where o is replaced with the sample standard deviation, s: 


t = Valy — po)/s = (Y — po)/se(y), [C.35] 


where se(y) = s/Vn is the standard error of y. Given the sample of data, it is easy to obtain t. We 
work with ¢ because, under the null hypothesis, the random variable 


T = Val Y¥ — m)/S 
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FIGURE C.5 Rejection region for a 5% significance level test against the one-sided 


alternative u > pp. 
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has a ¢,_, distribution. Now, suppose we have settled on a 5% significance level. Then, the critical 
value c is chosen so that P(T > c|Ho) = .05; that is, the probability of a Type I error is 5%. Once we 
have found c, the rejection rule is 


t>c, [C.36] 


where c is the 100(1 — æ) percentile in a ¢,_, distribution; as a percent, the significance level is 
100-a%. This is an example of a one-tailed test because the rejection region is in one tail of the ż dis- 
tribution. For a 5% significance level, c is the 95" percentile in the ¢,_ distribution; this is illustrated 
in Figure C.5. A different significance level leads to a different critical value. 

The statistic in equation (C.35) is often called the ¢ statistic for testing Hp: u = po. The f statistic 
measures the distance from Y to u relative to the standard error of y, se(y). 


Effect of Enterprise Zones on Business Investments 


In the population of cities granted enterprise zones in a particular state [see Papke (1994) for Indiana], 
let Y denote the percentage change in investment from the year before to the year after a city became 
an enterprise zone. Assume that Y has a Normal(j, o°) distribution. The null hypothesis that enter- 
prise zones have no effect on business investment is Hy: u = 0; the alternative that they have a posi- 
tive effect is Hı: u > 0. (We assume that they do not have a negative effect.) Suppose that we wish to 
test Ho at the 5% level. The test statistic in this case is 


p2 a y 
s/Vn  sely) 
Suppose that we have a sample of 36 cities that are granted enterprise zones. Then, the critical value is 


c = 1.69 (see Table G.2), and we reject Hy in favor of H, if t > 1.69. Suppose that the sample yields 
y = 8.2 ands = 23.9. Then, t = 2.06, and Hp is therefore rejected at the 5% level. Thus, we conclude 


[C.37] 
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that, at the 5% significance level, enterprise zones have an effect on average investment. The 1% criti- 
cal value is 2.44, so Hp is not rejected at the 1% level. The same caveat holds here as in Example C.2: 
we have not controlled for other factors that might affect investment in cities over time, so we cannot 
claim that the effect is causal. 


The rejection rule is similar for the one-sided alternative (C.33). A test with a significance level 
of 100-a% rejects Hp against (C.33) whenever 


t< =c; [C.38] 


in other words, we are looking for negative values of the t statistic—which implies y < >—that are 
sufficiently far from zero to reject Hp. 

For two-sided alternatives, we must be careful to choose the critical value so that the signifi- 
cance level of the test is still a. If H; is given by Hı: u # po, then we reject Hp if y is far from pp in 
absolute value: a y much larger or much smaller than uọ provides evidence against Hy in favor of H}. 
A 100-a@% level test is obtained from the rejection rule 


lt] >c, [C.39] 


where |¢| is the absolute value of the ż statistic in (C.35). This gives a two-tailed test. We must now 
be careful in choosing the critical value: c is the 100(1 — a/2) percentile in the f,,_, distribution. For 
example, if œ = .05, then the critical value is the 97.5" percentile in the ¢,_ ; distribution. This ensures 
that Ho is rejected only 5% of the time when it is true (see Figure C.6). For example, if n = 22, then 
the critical value is c = 2.08, the 97.5" percentile in a f}; distribution (see Table G.2). The absolute 
value of the ¢ statistic must exceed 2.08 in order to reject Hy against H, at the 5% level. 

It is important to know the proper language of hypothesis testing. Sometimes, the appropriate 
phrase “we fail to reject Ho in favor of H, at the 5% significance level” is replaced with “we accept 
Ho at the 5% significance level.” The latter wording is incorrect. With the same set of data, there are 


FIGURE C.6 Rejection region for a 5% significance level test against the two-sided 


alternative H,: y # uo 
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usually many hypotheses that cannot be rejected. In the earlier election example, it would be logically 
inconsistent to say that Hp: 0 = .42 and Hy: 0 = .43 are both “accepted,” because only one of these 
can be true. But it is entirely possible that neither of these hypotheses is rejected. For this reason, we 
always say “fail to reject Hy” rather than “accept Ho.” 


C-6c Asymptotic Tests for Nonnormal Populations 


If the sample size is large enough to invoke the central limit theorem (see Section C-3), the mechanics 
of hypothesis testing for population means are the same whether or not the population distribution is 
normal. The theoretical justification comes from the fact that, under the null hypothesis, 


T = Vn(Y — uo)/S © Normal(0,1). 


Therefore, with large n, we can compare the f statistic in (C.35) with the critical values from a stan- 
dard normal distribution. Because the ¢,,_ , distribution converges to the standard normal distribution 
as n gets large, the ft and standard normal critical values will be very close for extremely large n. 
Because asymptotic theory is based on n increasing without bound, it cannot tell us whether the stan- 
dard normal or t critical values are better. For moderate values of n, say, between 30 and 60, it is tra- 
ditional to use the rf distribution because we know this is correct for normal populations. For n > 120, 
the choice between the ¢ and standard normal distributions is largely irrelevant because the critical 
values are practically the same. 

Because the critical values chosen using either the standard normal or t distribution are only 
approximately valid for nonnormal populations, our chosen significance levels are also only approxi- 
mate; thus, for nonnormal populations, our significance levels are really asymptotic significance lev- 
els. Thus, if we choose a 5% significance level, but our population is nonnormal, then the actual 
significance level will be larger or smaller than 5% (and we cannot know which is the case). When the 
sample size is large, the actual significance level will be very close to 5%. Practically speaking, the 
distinction is not important, so we will now drop the qualifier “asymptotic.” 


Race Discrimination in Hiring 


In the Urban Institute study of discrimination in hiring (see Example C.3) using the data in AUDIT, 
we are primarily interested in testing Hy: u = 0 against H,: u < 0 where u = 03 — Oy is the differ- 
ence in probabilities that blacks and whites receive job offers. Recall that u is the population mean of 
the variable Y = B — W, where B and W are binary indicators. Using the n = 241 paired compari- 
sons in the data file AUDIT, we obtained y = —.133 and se(y) = .482/V/241 = .031. The ż statistic 
for testing Hy: u = 0 is t = —.133/.031 ~ —4.29. You will remember from Math Refresher B that 
the standard normal distribution is, for practical purposes, indistinguishable from the ¢ distribution 
with 240 degrees of freedom. The value —4.29 is so far out in the left tail of the distribution that we 
reject Hy at any reasonable significance level. In fact, the .005 (one-half of a percent) critical value 
(for the one-sided test) is about —2.58. A t value of —4.29 is very strong evidence against Ho in favor 
of H,. Hence, we conclude that there is discrimination in hiring. 


C-6d Computing and Using p-Values 


The traditional requirement of choosing a significance level ahead of time means that different 
researchers, using the same data and same procedure to test the same hypothesis, could wind up with 
different conclusions. Reporting the significance level at which we are carrying out the test solves this 
problem to some degree, but it does not completely remove the problem. 
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To provide more information, we can ask the following question: What is the largest significance 
level at which we could carry out the test and still fail to reject the null hypothesis? This value is 
known as the p-value of a test (sometimes called the prob-value). Compared with choosing a signifi- 
cance level ahead of time and obtaining a critical value, computing a p-value is somewhat more diffi- 
cult. But with the advent of quick and inexpensive computing, p-values are now fairly easy to obtain. 

As an illustration, consider the problem of testing Hp: u = 0 in a Normal( u, o°) population. 
Our test statistic in this case is T = Vn-Y/S, and we assume that n is large enough to treat T as hav- 
ing a standard normal distribution under Hp. Suppose that the observed value of T for our sample is 
t = 1.52. (Note how we have skipped the step of choosing a significance level.) Now that we have 
seen the value ¢, we can find the largest significance level at which we would fail to reject Hp. This 
is the significance level associated with using ż as our critical value. Because our test statistic T has a 
standard normal distribution under Hy, we have 


p-value = P(T > 1.52|H)) = 1 — ®(1.52) = .065, [C.40] 


where ®(-) denotes the standard normal cdf. In other words, the p-value in this example is simply the 
area to the right of 1.52, the observed value of the test statistic, in a standard normal distribution. See 
Figure C.7 for illustration. 

Because the p-value = .065, the largest significance level at which we can carry out this test and 
fail to reject is 6.5%. If we carry out the test at a level below 6.5% (such as at 5%), we fail to reject Hp. 
If we carry out the test at a level larger than 6.5% (such as 10%), we reject Hy. With the p-value at 
hand, we can carry out the test at any level. 

The p-value in this example has another useful interpretation: it is the probability that we observe 
a value of T as large as 1.52 when the null hypothesis is true. If the null hypothesis is actually true, we 
would observe a value of T as large as 1.52 due to chance only 6.5% of the time. Whether this is small 
enough to reject Hy depends on our tolerance for a Type I error. The p-value has a similar interpreta- 
tion in all other cases, as we will see. 

Generally, small p-values are evidence against Ho, because they indicate that the outcome of the 
data occurs with small probability if Hp is true. In the previous example, if t had been a larger value, 
say, t = 2.85, then the p-value would be 1 — ®(2.85) = .002. This means that, if the null hypoth- 
esis were true, we would observe a value of T as large as 2.85 with probability .002. How do we 


FIGURE C.7 The p-value when t = 1.52 for the one-sided alternative uuo. 
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interpret this? Either we obtained a very unusual sample or the null hypothesis is false. Unless we 
have a very small tolerance for Type I error, we would reject the null hypothesis. On the other hand, a 
large p-value is weak evidence against Hp. If we had gotten t = .47 in the previous example, then the 
p-value = 1 — ®(.47) = .32. Observing a value of T larger than .47 happens with probability .32, 
even when Hp is true; this is large enough so that there is insufficient doubt about Ho, unless we have 
a very high tolerance for Type I error. 

For hypothesis testing about a population mean using the ¢ distribution, we need detailed tables 
in order to compute p-values. Table G.2 only allows us to put bounds on p-values. Fortunately, many 
statistics and econometrics packages now compute p-values routinely, and they also provide calcula- 
tion of cdfs for the ¢ and other distributions used for computing p-values. 


Effect of Job Training Grants on Worker Productivity 


Consider again the Holzer et al. (1993) data in Example C.2. From a policy perspective, there are two 
questions of interest. First, what is our best estimate of the mean change in scrap rates, u? We have 
already obtained this for the sample of 20 firms listed in Table C.3: the sample average of the change 
in scrap rates is —1.15. Relative to the initial average scrap rate in 1987, this represents a fall in the 
scrap rate of about 26.3% (—1.15/4.38 ~ —.263), which is a nontrivial effect. 

We would also like to know whether the sample provides strong evidence for an effect in the 
population of manufacturing firms that could have received grants. The null hypothesis is Hy: u = 0, 
and we test this against H;: u < 0, where u is the average change in scrap rates. Under the null, the 
job training grants have no effect on average scrap rates. The alternative states that there is an effect. 
We do not care about the alternative u > 0, so the null hypothesis is effectively Hp: u = 0. 

Because y = —1.15 and se(y) = .54, t = —1.15/.54 = —2.13. This is below the 5% critical 
value of — 1.73 (from a tio distribution) but above the 1% critical value, —2.54. The p-value in this 
case is computed as 


p-value = P(T; < —2.13), [C.41] 


FIGURE C.8 The p-value when t = —2.13 with 19 degrees of freedom for the one-sided alter- 
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where 7}, represents a ¢ distributed random variable with 19 degrees of freedom. The inequality is 
reversed from (C.40) because the alternative has the form in (C.33). The probability in (C.41) is the 
area to the left of —2.13 in a ¢,, distribution (see Figure C.8). 

Using Table G.2, the most we can say is that the p-value is between .025 and .01, but it is closer 
to .025 (because the 97.5" percentile is about 2.09). Using a statistical package, such as Stata®, we 
can compute the exact p-value. It turns out to be about .023, which is reasonable evidence against Hp. 
This is certainly enough evidence to reject the null hypothesis that the training grants had no effect at 
the 2.5% significance level (and therefore at the 5% level). 


Computing a p-value for a two-sided test is similar, but we must account for the two-sided nature 
of the rejection rule. For ¢ testing about population means, the p-value is computed as 


P(T, -l > |e) = 2P(T,-1 > Id), [C.42] 


where ż is the value of the test statistic and 7,,_ , is a t random variable. (For large n, replace T,,_ , with 
a standard normal random variable.) Thus, compute the absolute value of the ż statistic, find the area 
to the right of this value in a ż,—; distribution, and multiply the area by two. 

For nonnormal populations, the exact p-value can be difficult to obtain. Nevertheless, we can 
find asymptotic p-values by using the same calculations. These p-values are valid for large sample 
sizes. For n larger than, say, 120, we might as well use the standard normal distribution. Table G.1 is 
detailed enough to get accurate p-values, but we can also use a statistics or econometrics program. 


Race Discrimination in Hiring 


Using the matched pairs data from the Urban Institute in the AUDIT data file (n = 241), we obtained 
t = —4,29. If Z is a standard normal random variable, P(Z < —4.29) is, for practical purposes, zero. 
In other words, the (asymptotic) p-value for this example is essentially zero. This is very strong evi- 
dence against Hp. 


Summary of How to Use p-Values: 


(i) Choose a test statistic T and decide on the nature of the alternative. This determines whether 
the rejection rule is t > c, t < —c, or |t| > c. 

(ii) Use the observed value of the f statistic as the critical value and compute the correspond- 
ing significance level of the test. This is the p-value. If the rejection rule is of the form t > c, then 
p-value = P(T > t). If the rejection rule is t < —c, then p-value = P(T < t); if the rejection rule is 
|z| > c, then p-value = P(|7| > |¢)). 

(iii) If a significance level a has been chosen, then we reject Hy at the 100-a@% level if 
p-value < a. If p-value = a, then we fail to reject Hy at the 100-a% level. Therefore, it is a small 
p-value that leads to rejection of the null hypothesis. 


C-6e The Relationship between Confidence Intervals 
and Hypothesis Testing 


Because constructing confidence intervals and hypothesis tests both involve probability statements, it 
is natural to think that they are somehow linked. It turns out that they are. After a confidence interval 
has been constructed, we can carry out a variety of hypothesis tests. 

The confidence intervals we have discussed are all two-sided by nature. (In this text, we will 
have no need to construct one-sided confidence intervals.) Thus, confidence intervals can be used to 
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test against two-sided alternatives. In the case of a population mean, the null is given by (C.31), and 
the alternative is (C.34). Suppose we have constructed a 95% confidence interval for w. Then, if the 
hypothesized value of u under Hp, Mo, is not in the confidence interval, then Hp: u = mọ is rejected 
against H,: u # mọ at the 5% level. If mọ lies in this interval, then we fail to reject Hp at the 5% level. 
Notice how any value for mọ can be tested once a confidence interval is constructed, and because a con- 
fidence interval contains more than one value, there are many null hypotheses that will not be rejected. 


EXAMPLE C.8 Training Grants and Worker Productivity 


In the Holzer et al. example, we constructed a 95% confidence interval for the mean change in scrap 
rate u as [—2.28, —.02]. Because zero is excluded from this interval, we reject Hp: u = 0 against 
Hı: u # Oat the 5% level. This 95% confidence interval also means that we fail to reject Hp: u = —2 
at the 5% level. In fact, there is a continuum of null hypotheses that are not rejected given this confi- 
dence interval. 


C-6f Practical versus Statistical Significance 


In the examples covered so far, we have produced three kinds of evidence concerning population 
parameters: point estimates, confidence intervals, and hypothesis tests. These tools for learning about 
population parameters are equally important. There is an understandable tendency for students to 
focus on confidence intervals and hypothesis tests because these are things to which we can attach 
confidence or significance levels. But in any study, we must also interpret the magnitudes of point 
estimates. 

The sign and magnitude of y determine its practical significance and allow us to discuss the 
direction of an intervention or policy effect, and whether the estimated effect is “large” or “small.” 
On the other hand, statistical significance of y depends on the magnitude of its ¢ statistic. For testing 
Ho: u = 0, the ż statistic is simply t = y/se(y). In other words, statistical significance depends on the 
ratio of y to its standard error. Consequently, a ¢ statistic can be large because y is large or se(y) is 
small. In applications, it is important to discuss both practical and statistical significance, being aware 
that an estimate can be statistically significant without being especially large in a practical sense. 
Whether an estimate is practically important depends on the context as well as on one’s judgment, so 
there are no set rules for determining practical significance. 


Effect of Freeway Width on Commute Time 


Let Y denote the change in commute time, measured in minutes, for commuters in a metropolitan area 
from before a freeway was widened to after the freeway was widened. Assume that Y ~ Normal(,07). 
The null hypothesis that the widening did not reduce average commute time is Hp: u = 0; the alterna- 
tive that it reduced average commute time is H,: u < 0. Suppose a random sample of commuters of 
size n = 900 is obtained to determine the effectiveness of the freeway project. The average change 
in commute time is computed to be y = —3.6, and the sample standard deviation is s = 32.7; thus, 
se(y) = 32.7/V900 = 1.09. The ż statistic is t = —3.6/1.09 = —3.30, which is very statistically sig- 
nificant; the p-value is about .0005. Thus, we conclude that the freeway widening had a statistically 
significant effect on average commute time. 

If the outcome of the hypothesis test is all that were reported from the study, it would be mis- 
leading. Reporting only statistical significance masks the fact that the estimated reduction in average 
commute time, 3.6 minutes, seems pretty meager, although this depends to some extent on what the 
average commute time was prior to widening the freeway. To be up front, we should report the point 
estimate of —3.6, along with the significance test. 
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Finding point estimates that are statistically significant without being practically significant can 
occur when we are working with large samples. To discuss why this happens, it is useful to have the 
following definition. 


Test Consistency. A consistent test rejects Ho with probability approaching one as the sam- 
ple size grows whenever H, is true. 

Another way to say that a test is consistent is that, as the sample size tends to infinity, the power 
of the test gets closer and closer to unity whenever H, is true. All of the tests we cover in this text 
have this property. In the case of testing hypotheses about a population mean, test consistency follows 
because the variance of Y converges to zero as the sample size gets large. The f statistic for testing 
Ho: u = O is T = Y/(S/Vn). Because plim(Y) = u and plim(S) = ø, it follows that if, say, u > 0, 
then T gets larger and larger (with high probability) as n —> œ. In other words, no matter how close m 
is to zero, we can be almost certain to reject Hp: u = 0 given a large enough sample size. This says 
nothing about whether u is large in a practical sense. 


C-7 Remarks on Notation 


In our review of probability and statistics here and in Math Refresher B, we have been careful to use 
standard conventions to denote random variables, estimators, and test statistics. For example, we have 
used W to indicate an estimator (random variable) and w to denote a particular estimate (outcome 
of the random variable W). Distinguishing between an estimator and an estimate is important for 
understanding various concepts in estimation and hypothesis testing. However, making this distinc- 
tion quickly becomes a burden in econometric analysis because the models are more complicated: 
many random variables and parameters will be involved, and being true to the usual conventions from 
probability and statistics requires many extra symbols. 

In the main text, we use a simpler convention that is widely used in econometrics. If 0 is a popu- 
lation parameter, the notation 6 (“theta hat”) will be used to denote both an estimator and an estimate 
of 0. This notation is useful in that it provides a simple way of attaching an estimator to the popula- 
tion parameter it is supposed to be estimating. Thus, if the population parameter is B, then B denotes 
an estimator or estimate of £; if the parameter is a”, ô?” is an estimator or estimate of o°; and so on. 
Sometimes, we will discuss two estimators of the same parameter, in which case we will need a dif- 
ferent notation, such as ð (“theta tilde”). 

Although dropping the conventions from probability and statistics to indicate estimators, random 
variables, and test statistics puts additional responsibility on you, it is not a big deal once the differ- 
ence between an estimator and an estimate is understood. If we are discussing statistical properties 
of ĝ—such as deriving whether or not it is unbiased or consistent—then we are necessarily viewing 6 
as an estimator. On the other hand, if we write something like 6 = 1.73, then we are clearly denoting 
a point estimate from a given sample of data. The confusion that can arise by using 6 to denote both 
should be minimal once you have a good understanding of probability and statistics. 


Summary 


We have discussed topics from mathematical statistics that are heavily relied upon in econometric 
analysis. The notion of an estimator, which is simply a rule for combining data to estimate a popula- 
tion parameter, is fundamental. We have covered various properties of estimators. The most important 
small sample properties are unbiasedness and efficiency, the latter of which depends on comparing 
variances when estimators are unbiased. Large sample properties concern the sequence of estimators 
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obtained as the sample size grows, and they are also depended upon in econometrics. Any useful esti- 
mator is consistent. The central limit theorem implies that, in large samples, the sampling distribution 
of most estimators is approximately normal. 
The sampling distribution of an estimator can be used to construct confidence intervals. We saw this 
for estimating the mean from a normal distribution and for computing approximate confidence intervals 
in nonnormal cases. Classical hypothesis testing, which requires specifying a null hypothesis, an alter- 
native hypothesis, and a significance level, is carried out by comparing a test statistic to a critical value. 
Alternatively, a p-value can be computed that allows us to carry out a test at any significance level. 


Key Terms 


Alternative Hypothesis 
Asymptotic Normality 

Bias 

Biased Estimator 

Central Limit Theorem (CLT) 
Confidence Interval 
Consistent Estimator 
Consistent Test 

Critical Value 

Estimate 

Estimator 

Hypothesis Test 

Inconsistent 

Interval Estimator 

Law of Large Numbers (LLN) 
Least Squares Estimator 
Log-Likelihood Function 


Problems 


Maximum Likelihood Estimator 

Mean Squared Error (MSE) 

Method of Moments 

Minimum Variance Unbiased 
Estimator 

Null Hypothesis 

One-Sided Alternative 

One-Tailed Test 

Population 

Power of a Test 

Practical Significance 

Probability Limit 

p-Value 

Random Sample 

Rejection Region 

Sample Average 

Sample Correlation Coefficient 


Sample Covariance 

Sample Standard Deviation 

Sample Variance 

Sampling Distribution 

Sampling Standard 
Deviation 

Sampling Variance 

Significance Level 

Standard Error 

Statistical Significance 

t Statistic 

Test Statistic 

Two-Sided Alternative 

Two-Tailed Test 

Type I Error 

Type II Error 

Unbiased Estimator 


1 Let Y., Y», Y}, and Y, be independent, identically distributed random variables from a population with 


7 = | 
mean u and variance g^. Let Y = ries + Y, + Y; + Y,) denote the average of these four random 


variables. 


(i) What are the expected value and variance of Y in terms of u and o°? 
(ii) Now, consider a different estimator of u: 


1 1 1 1 


W= Mt at ase 


This is an example of a weighted average of the Y;. Show that W is also an unbiased estimator of 


y. Find the variance of W. 


(iii) Based on your answers to parts (i) and (ii), which estimator of m do you prefer, Y or W? 


2 This is a more general version of Problem C.1. Let Y}, Y,..., 


Y, be n pairwise uncorrelated random 


variables with common mean m and common variance a”. Let Y denote the sample average. 
(i) Define the class of linear estimators of by 


W, = aY, + MY, +--+ ay, 


n? 


where the a; are constants. What restriction on the a; is needed for W, to be an unbiased estimator of u? 
(ii) Find Var(W,). 


Math Refresher C Fundamentals of Mathematical Statistics 745 


(iii) For any numbers a, ds, . . . , a, the following inequality holds: 
(a, + a +++) + a,)/n S a} + a +++ + a. Use this, along with parts (i) and (ii), to show 
that Var(W,) = Var(Y) whenever W, is unbiased, so that Y is the best linear unbiased estimator. 
[Hint: What does the inequality become when the q; satisfy the restriction from part (i)?] 


3 Let Y denote the sample average from a random sample with mean u and variance 0. Consider two 
alternative estimators of u: W, = [(n — 1)/n]Y and W, = ¥/2. 

(i) Show that W, and W, are both biased estimators of u and find the biases. What happens to the 
biases as n — ©? Comment on any important differences in bias for the two estimators as the 
sample size gets large. 

(ii) Find the probability limits of W; and W}. {Hint: Use Properties PLIM.1 and PLIM.2; for W,, note 
that plim [(n — 1)/n] = 1.} Which estimator is consistent? 

Gii) Find Var(W,) and Var(W,). 

(iv) Argue that W, is a better estimator than Y if y is “close” to zero. (Consider both bias and variance.) 


4 For positive random variables X and Y, suppose the expected value of Y given X is E(Y|X) = 0X. The 

unknown parameter 0 shows how the expected value of Y changes with X. 

(i) Define the random variable Z = Y/X. Show that E(Z) = 0. [Hint: Use Property CE.2 in Math 
Refresher B along with the law of iterated expectations, Property CE.4 (also in Math Refresher B). 
In particular, first show that E(Z|X) = 0 and then use CE.4.] 

(ii) Use part (i) to prove that the estimator W, = n-'>7_,(¥;/X;) is unbiased for 0, where 
{(X;, Y;): i = 1,2,...,n} is arandom sample. 

(iii) Explain why the estimator W, = Y/X, where the overbars denote sample averages, is not the 
same as W,. Nevertheless, show that W, is also unbiased for 0. 

(iv) The following table contains data on corn yields for several counties in Iowa. The USDA predicts 
the number of hectares of corn in each county based on satellite photos. Researchers count the num- 
ber of “pixels” of corn in the satellite picture (as opposed to, for example, the number of pixels of 
soybeans or of uncultivated land) and use these to predict the actual number of hectares. To develop 
a prediction equation to be used for counties in general, the USDA surveyed farmers in selected 
counties to obtain corn yields in hectares. Let Y; = corn yield in county i and let X; = number of 
corn pixels in the satellite picture for county i. There are n = 17 observations for eight counties. Use 
this sample to compute the estimates of 0 devised in parts (ii) and (iii). Are the estimates similar? 


Plot Corn Yield Corn Pixels 
1 165.76 374 
2 96.32 209 
3 76.08 253 
4 185.35 432 
5 116.43 367 
6 162.08 361 
7 152.04 288 
8 161.75 369 
9 92.88 206 

10 149.94 316 

11 64.75 145 

12 127.07 355 

13) 1133:55 295 

14 77.70 223 

15 206.39 459 

16 108.33 290 

lZ 118.17 307 
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Let Y denote a Bernoulli(@) random variable with O < 6 < 1. Suppose we are interested in estimat- 
ing the odds ratio, y = 0/(1 — 0), which is the probability of success over the probability of failure. 
Given a random sample {Y,, . . . , Y,,}, we know that an unbiased and consistent estimator of 0 is Y, the 
proportion of successes in n trials. A natural estimator of y is G = Y/(1 — Y), the proportion of suc- 
cesses over the proportion of failures in the sample. 

(i) Why is G not an unbiased estimator of y? 

(ii) Use PLIM.2 (iii) to show that G is a consistent estimator of y. 


You are hired by the governor to study whether a tax on liquor has decreased average liquor consump- 
tion in your state. You are able to obtain, for a sample of individuals selected at random, the difference 
in liquor consumption (in ounces) for the years before and after the tax. For person i who is sampled 
randomly from the population, Y; denotes the change in liquor consumption. Treat these as a random 
sample from a Normal(j, o°) distribution. 
(i) The null hypothesis is that there was no change in average liquor consumption. State this 
formally in terms of u. 
(ii) The alternative is that there was a decline in liquor consumption; state the alternative in terms of u. 
(iii) Now, suppose your sample size is n = 900 and you obtain the estimates y = —32.8 and 
s = 466.4. Calculate the ¢ statistic for testing Hp against H,; obtain the p-value for the test. 
(Because of the large sample size, just use the standard normal distribution tabulated in 
Table G.1.) Do you reject Hy at the 5% level? At the 1% level? 
(iv) Would you say that the estimated fall in consumption is large in magnitude? Comment on the 
practical versus statistical significance of this estimate. 
(v) What has been implicitly assumed in your analysis about other determinants of liquor consumption 
over the two-year period in order to infer causality from the tax change to liquor consumption? 


The new management at a bakery claims that workers are now more productive than they were under 

old management, which is why wages have “generally increased.” Let W? be Worker i’s wage under 

the old management and let W? be Worker i’s wage after the change. The difference is D; = W? — W?. 

Assume that the D; are a random sample from a Normal (u, o°) distribution. 

(i) Using the following data on 15 workers, construct an exact 95% confidence interval for u. 

(ii) Formally state the null hypothesis that there has been no change in average wages. In particular, 
what is E(D;) under Hy? If you are hired to examine the validity of the new management’s claim, 
what is the relevant alternative hypothesis in terms of u = E(D,)? 

(iii) Test the null hypothesis from part (ii) against the stated alternative at the 5% and 1% levels. 

(iv) Obtain the p-value for the test in part (iii). 


Worker Wage Before Wage After 
1 8.30 9.25 
2 9.40 9.00 
3 9.00 9.25 
4 10.50 10.00 
5 11.40 12.00 
6 8.75 9.50 
y 10.00 10.25 
8 9.50 9.50 
9 10.80 11.50 

10 12.55 13.10 
11 12.00 11.50 
12 8.65 9.00 
13 7.15 7.15 
14 11.25 11.50 
15 12.65 13.00 


8 


10 
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The New York Times (2/5/90) reported three-point shooting performance for the top 10 three-point shoot- 
ers in the NBA. The following table summarizes these data: 


Player FGA-FGM 
Mark Price 429-188 
Trent Tucker 833-345 
Dale Ellis 1,149-472 
Craig Hodges 1,016-396 
Danny Ainge 1,051-406 
Byron Scott 676-260 
Reggie Miller 416-159 
Larry Bird 1,206-455 
Jon Sundvold 440-166 
Brian Taylor 417-157 


Note: FGA = field goals attempted and FGM = field goals made. 


For a given player, the outcome of a particular shot can be modeled as a Bernoulli (zero-one) variable: 

if Y, is the outcome of shot i, then Y; = 1 if the shot is made, and Y; = 0 if the shot is missed. Let 6 

denote the probability of making any particular three-point shot attempt. The natural estimator of 0 is 

Y = FGM/FGA. 

(i) Estimate 0 for Mark Price. 

(ii) Find the standard deviation of the estimator Y in terms of 0 and the number of shot attempts, n. 

(iii) The asymptotic distribution of (Y — @)/se(Y) is standard normal, where se(Y) = VV — Y)/n. 
Use this fact to test Hj: 0 = .5 against H,: 0 < .5 for Mark Price. Use a 1% significance level. 


Suppose that a military dictator in an unnamed country holds a plebiscite (a yes/no vote of confidence) 

and claims that he was supported by 65% of the voters. A human rights group suspects foul play and 

hires you to test the validity of the dictator’s claim. You have a budget that allows you to randomly 
sample 200 voters from the country. 

(i) Let X be the number of yes votes obtained from a random sample of 200 out of the entire 
voting population. What is the expected value of X if, in fact, 65% of all voters supported the 
dictator? 

(ii) What is the standard deviation of X, again assuming that the true fraction voting yes in the 
plebiscite is .65? 

(ii) Now, you collect your sample of 200, and you find that 115 people actually voted yes. Use the 
CLT to approximate the probability that you would find 115 or fewer yes votes from a random 
sample of 200 if, in fact, 65% of the entire population voted yes. 

(iv) How would you explain the relevance of the number in part (iii) to someone who does not have 
training in statistics? 


Before a strike prematurely ended the 1994 major league baseball season, Tony Gwynn of the San Diego 
Padres had 165 hits in 419 at bats, for a .394 batting average. There was discussion about whether 
Gwynn was a potential .400 hitter that year. This issue can be couched in terms of Gwynn’s prob- 
ability of getting a hit on a particular at bat, call it 0. Let Y; be the Bernoulli(@) indicator equal to unity 
if Gwynn gets a hit during his i” at bat, and zero otherwise. Then, Y b Yo,..., Y, is a random sample 


from a Bernoulli(@) distribution, where 0 is the probability of success, and n = 419. 
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Our best point estimate of 6 is Gwynn’s batting average, which is just the proportion of suc- 
cesses: y = .394. Using the fact that se(y) = Vy(1 — ¥)/n, construct an approximate 95% confi- 
dence interval for 0, using the standard normal distribution. Would you say there is strong evidence 
against Gwynn’s being a potential .400 hitter? Explain. 


Suppose that between their first and second years in college, 400 students are randomly selected and 
given a university grant to purchase a new computer. For student i, y; denotes the change in GPA from 
the first year to the second year. If the average change is y = .132 with standard deviation s = 1.27, is 
the average change in GPAs statistically greater than zero? 


(Requires Calculus) A count random variable, say Y, takes on nonnegative integer values, {0, 1, 2,...}. 
The most common distribution for a count variable is the Poisson(@) distribution, where the param- 
eter 0 is the expected value: 0 = E(Y). The probability density function is 


fO; 0) = exp(—9)P/y!, y = 0, 1, 2,... 
= 0 otherwise 


It can be shown that Var(Y) = 0, so that the mean and variance are the same. 

(i) For a random draw Y, from the population, find the log-likelihood function €(; Y) = log[f(Y;; 0) |. 
What is the log likelihood for a random sample of size n, say £,(0)? [Hint: Look at equation (C.16).] 

Gii) Using the notational convention in Section C.7, find the first order condition for the MLE, 6, and 
show that Ô = Y, the sample average. 

Gii) Why is Ô unbiased? 

(iv) Find Var(Y) as a function of 0 and n. 

(v) Why is Y consistent? 

(vi) Do the unbiasedness and consistency of the MLE in this case depend on whether the Poisson dis- 
tribution is correct? Explain. 

(vil) What is the distribution of 


Vn(Y — 6) 
Ve 


as n > ©? Explain. 
(viii) If E(Y) = 0 but Var(Y) = v(0) > 0—so that the Poisson distribution may fail—modify the 
random variable in (vii) so that it has a limiting distribution that does not depend on 8. 
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Summary of Matrix Algebra 


his Advanced Treatment summarizes the matrix algebra concepts, including the algebra of prob- 
ability, needed for the study of multiple linear regression models using matrices in Advanced 
Treatment E. None of this material is used in the main text. 


D-1 Basic Definitions 


Definition D.1 (Matrix). A matrix is a rectangular array of numbers. More precisely, an 
m X n matrix has m rows and n columns. The positive integer m is called the row dimension, and n is 
called the column dimension. 


We use uppercase boldface letters to denote matrices. We can write an m X n matrix 
generically as 


a, i i3 din 
ai An a3 Aan 

A = [a;] , À 
Amı Am Am3 eA Amn 


where a; represents the element in the i™ row and the j column. For example, a>; stands for the num- 
ber in the second row and the fifth column of A. A specific example of a2 X 3 matrix is 


2. =i 7 


where a3 = 7. The shorthand A = [a;] is often used to define matrix operations. 


Definition D.2 (Square Matrix). A square matrix has the same number of rows and col- 
umns. The dimension of a square matrix is its number of rows and columns. 


Definition D.3 (Vectors) 


(i) A 1 X m matrix is called a row vector (of dimension m) and can be written as 
X = (x1, %, - sAm). 
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(ii) Ann X 1 matrix is called a column vector and can be written as 


Yn 
Definition D.4 (Diagonal! Matrix). A square matrix A is a diagonal matrix when all of its 
off-diagonal elements are zero, that is, a, = O for all i # j. We can always write a diagonal matrix as 


ay 0 0 sie 0 
A= “ an O ... O 
0 0 “OO wu. de, 


Definition D.5 (Identity and Zero Matrices) 
(i) The n X n identity matrix, denoted I, or sometimes I, to emphasize its dimension, is the 
diagonal matrix with unity (one) in each diagonal position, and zero elsewhere: 


100.. 0 
010.. 0 

I=L=|. 
6.2 O ak. 4 


(ii) The m X n zero matrix, denoted 0, is the m X n matrix with zero for all entries. This need 
not be a square matrix. 


D-2 Matrix Operations 
D-2a Matrix Addition 


Two matrices A and B, each having dimension m X n, can be added element by element: 
A + B = [a; + by]. More precisely, 


ay, + by an tbr ain + bin 

Ay, + bz an + dy... an + bon 
A+B= . 

amı + bmi Am2 + bm Amn Din 


For example, 
2 =E Taft 0 =A 3 -1 3 
—4 5 0 4 2 3 0 7 3] 
Matrices of different dimensions cannot be added. 


D-2b Scalar Multiplication 


Given any real number y (often called a scalar), scalar multiplication is defined as yA = [yaj], or 
Yä Ya > VE 


Yazn Yan ~- Yan 


yA = 


Y4ini Vana one Yamn 
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For example, if y = 2 and A is the matrix in equation (D.1), then 
4 -2 14 
yA = ‘ 
-8 10 Q 


D-2c Matrix Multiplication 


To multiply matrix A by matrix B to form the product AB, the column dimension of A must equal the 
row dimension of B. Therefore, let A be an m X n matrix and let B be ann X p matrix. Then, matrix 
multiplication is defined as 


In other words, the (i, j)® element of the new matrix AB is obtained by multiplying each element 
in the i" row of A by the corresponding element in the j column of B and adding these n products 
together. A schematic may help make this process more transparent: 


A B AB 
baj n 
“th = 
TOW > | Ajapa... Gin bs; = > aby ; 
. k=1 
b 


j* column (i, j)" element 


where, by the definition of the summation operator in Math Refresher A, 
n 
X aby = adi; + anb aera: nD; 
k=1 


For example, 


We can also multiply a matrix and a vector. If A is an n X m matrix and y is an m X 1 vector, 
then Ay is ann X 1 vector. If x isa 1 X n vector, then xA is a1 X m vector. 

Matrix addition, scalar multiplication, and matrix multiplication can be combined in various 
ways, and these operations satisfy several rules that are familiar from basic operations on numbers. 
In the following list of properties, A, B, and C are matrices with appropriate dimensions for applying 
each operation, and @ and $ are real numbers. Most of these properties are easy to illustrate from the 
definitions. 


Properties of Matrix Operations. (1) (a + B)A = aA + BA; (2) a(A + B)= 
aA + aB; (3) (aB)A = a( BA); (4) a(AB) = (a@A)B; (5) A + B = B + A; (6) (A + B) + C= 
A + (B + C); (7) (AB)C = A(BC); (8) A(B + C) = AB + AC; (9) (A + B)C = AC + BC; 
(10) IA = AI = A; (11) A+0=0+A=A; (12) A—A=0; (13) AO = 0A = 0; and 
(14) AB # BA, even when both products are defined. 
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The last property deserves further comment. If A is n X m and B is m X p, then AB is defined, but 
BA is defined only if n = p (the row dimension of A equals the column dimension of B). If A is 
m X n and B isn X m, then AB and BA are both defined, but they are not usually the same; in fact, 
they have different dimensions, unless A and B are both square matrices. Even when A and B are both 
square, AB # BA, except under special circumstances. 


D-2d Transpose 


Definition D.6 (Transpose). Let A = [a;] be an m X n matrix. The transpose of A, 
denoted A’ (called A prime), is the n X m matrix obtained by interchanging the rows and columns 
of A. We can write this as A’ = [a,j]. 


For example, 


2 =i 7 a 
7 0 


Properties of Transpose. (1) (A')' = A; (2) (aA)' = a@A’ for any scalar a; 
(3) (A + BY = A’ + B'; (4) (ABY = B'A’, where A is m X n and B isn X k; (5) xx = Xi- 1x7, 
where x is ann X 1 vector; and (6) If A is ann X k matrix with rows given by the | X k vectors 
aj, @,...,4,, So that we can write 


then A’ = (ajaj... al). 
Definition D.7 (Symmetric Matrix). A square matrix A is a symmetric matrix if, and 
only if, A’ = A. 

If X is any n X k matrix, then X’X is always defined and is a symmetric matrix, as can be seen by 
applying the first and fourth transpose properties (see Problem 3). 


D-2e Partitioned Matrix Multiplication 


Let A be ann X k matrix with rows given by the 1 X k vectors a4, a,...,a,, and let B be ann X m 
matrix with rows given by | X m vectors by, bo, ..., b,! 
a, b; 
A= a B= bz 
a, b, 
Then, 


A'B = > a; b, 


i=1 
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where for each i, a/b; is ak X m matrix. Therefore, A'B can be written as the sum of n matrices, each 
of which is k X m. As a special case, we have 


n 
A'A = Data, 
i=1 
where aja; is ak X k matrix for all i. 
A more general form of partitioned matrix multiplication holds when we have matrices A 
(m X n) and B (n X p) written as 


Az Fa z] B= & a 
An Ag/’ Ba  B/’ 
where A,, is m; X my, Ay iS m, X Ny, Ay, iS m X Ny, Ago is m X m, By isn, X pi, By is ny X Po, 


Bi isn, X p,, and By, is n, X p>. Naturally, m; + m, = m,n, + m = n, and pı + p, = p. 
When we form the product AB, the expression looks just like when the entries are scalars: 


AB = ie + ApBy  AnBy + a 
AxB + AB AB + AxBy»/ 


Note that each of the matrix multiplications that form the partition on the right is well defined because 
the column and row dimensions are compatible for multiplication. 


D-2f Trace 


The trace of a matrix is a very simple operation defined only for square matrices. 


Definition D.8 (Trace). For any n X n matrix A, the trace of a matrix A, denoted tr(A), is the 
sum of its diagonal elements. Mathematically, 


i=1 


Properties of Trace. (1) tr(I,) = n; (2) tr(A’) = tr(A); (3) tr(A + B) = tr(A) + tr(B); 
(4) tr(@A) = atr(A), for any scalar a; and (5) tr(AB) = tr(BA), where A ism X nand Bisn X m. 


D-2g Inverse 


The notion of a matrix inverse is very important for square matrices. 


Definition D.9 (Inverse). Ann X n matrix A has an inverse, denoted A ', provided that 
A'A =I, and AA‘! = I,. In this case, A is said to be invertible or nonsingular. Otherwise, it is said 
to be noninvertible or singular. 


Properties of Inverse. (1) If an inverse exists, it is unique; (2) (aA)~' = (1/a)A“', if 
a + 0 and A is invertible; (3) (AB)! = B-'A“!, if A and B are both n X n and invertible; and 
(4) (A) = (A7. 


We will not be concerned with the mechanics of calculating the inverse of a matrix. Any matrix alge- 
bra text contains detailed examples of such calculations. 
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D-3 Linear Independence and Rank of a Matrix 


For a set of vectors having the same dimension, it is important to know whether one vector can be 
expressed as a linear combination of the remaining vectors. 


Definition D.10 (Linear Independence). Let {x,, x,,...,x,} be a set of n X 1 vectors. 
These are linearly independent vectors if, and only if, 


a,x, + ax, +++: + a,x, = 0 [D.2] 


implies that a, = a, =- = a, = 0. If (D.2) holds for a set of scalars that are not all zero, then 
{X}, X>,..., X,} is linearly dependent. 

The statement that {x,, x,,..., x,} is linearly dependent is equivalent to saying that at least one 
vector in this set can be written as a linear combination of the others. 


Definition D.11 (Rank) 

(i) Let A be ann X m matrix. The rank of a matrix A, denoted rank(A), is the maximum num- 
ber of linearly independent columns of A. 

(ii) If A isn X mand rank(A) = m, then A has full column rank. 


IfA isn X m, its rank can be at most m. A matrix has full column rank if its columns form a lin- 
early independent set. For example, the 3 X 2 matrix 


1 3 
2 6 
0 0 
can have at most rank two. In fact, its rank is only one because the second column is three times the 


first column. 


Properties of Rank. (1) rank(A’) = rank(A); (2) If A is n X k, then rank(A) = min(n, k); 
and (3) If A is k X k and rank(A) = k, then A is invertible. 


D-4 Quadratic Forms and Positive Definite Matrices 


Definition D.12 (Quadratic Form). Let A be ann X n symmetric matrix. The quadratic 
form associated with the matrix A is the real-valued function defined for all n X 1 vectors x: 


n n 


f(x) = x/Ax = Sa ag 2>, X ajjxix. 
i=1 


i=1 j>i 
Definition D.13 (Positive Definite and Positive Semi-Definite) 
(i) A symmetric matrix A is said to be positive definite (p.d.) if 
x’Ax > 0 for alln X 1 vectors x except x = 0. 
(ii) A symmetric matrix A is positive semi-definite (p.s.d.) if 
x’Ax = 0 forall X 1 vectors. 


If a matrix is positive definite or positive semi-definite, it is automatically assumed to be symmetric. 
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Properties of Positive Definite and Positive Semi-Definite Matrices. (1) A 
p.d. matrix has diagonal elements that are strictly positive, while a p.s.d. matrix has nonnegative 
diagonal elements; (2) If A is p.d., then A | exists and is p.d.; (3) If X is n X k, then X'X and XX’ are 
p.s.d.; and (4) If X ism X k and rank(X) = k, then X’X is p.d. (and therefore nonsingular). 


D-5 lIdempotent Matrices 


Definition D.14 (Idempotent Matrix). Let A be ann X n symmetric matrix. Then A is 
said to be an idempotent matrix if, and only if, AA = A. 


For example, 


oor 
ooo 
a) 


is an idempotent matrix, as direct multiplication verifies. 


Properties of Idempotent Matrices. Let A be an n X n idempotent matrix. 
(1) rank(A) = tr(A), and (2) A is positive semi-definite. 


We can construct idempotent matrices very generally. Let X be an n X k matrix with 
rank(X) = k. Define 
P = X(X’X) X 
M = I, — X(X'X)'X' =I, - P. 
Then P and M are symmetric, idempotent matrices with rank(P) = k and rank(M) = n — k. The 


ranks are most easily obtained by using Property 1: tr(P) = tr[(X’X)~'X’X] (from Property 5 for 
trace) = tr(I,) = k (by Property 1 for trace). It easily follows that tr(M) = tr(I,,) — tr(P) = n — k. 


D-6 Differentiation of Linear and Quadratic Forms 


For a given n X 1 vector a, consider the linear function defined by 
f(x) = a'x, 


for all n X 1 vectors x. The derivative of f with respect to x is the 1 X n vector of partial derivatives, 
which is simply 


of(x)/ox = a’. 
For ann X n symmetric matrix A, define the quadratic form 
g(x) = x’Ax. 
Then, 
ðg(x)/ðx = 2x’A, 


which isa 1 X n vector. 
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D-7 Moments and Distributions of Random Vectors 


In order to derive the expected value and variance of the OLS estimators using matrices, we need to 
define the expected value and variance of a random vector. As its name suggests, a random vector 
is simply a vector of random variables. We also need to define the multivariate normal distribution. 
These concepts are simply extensions of those covered in Math Refresher B. 


D-7a Expected Value 


Definition D.15 (Expected Value) 
Gi) Ifyis ann X 1 random vector, the expected value of y, denoted E(y), is the vector of expected 


values: E(y) = [E(y,), E(y2),.-., Ep) I’. 
(ii) If Z is ann X m random matrix, E(Z) is the n X m matrix of expected values: E(Z) = [E(z;)]. 


Properties of Expected Value. (1) If A is anm X n matrix and b is ann X 1 vector, 
where both are nonrandom, then E(Ay + b) = AE(y) + b; and (2) If A is p X n and B is m X k, 
where both are nonrandom, then E(AZB) = AE(Z)B. 


D-7b Variance-Covariance Matrix 


Definition D.16 (Variance-Covariance Matrix). If y is ann X 1 random vector, its 
variance-covariance matrix, denoted Var(y), is defined as 


2 
O7 O12 PES Tij 
2 
Or, O47 TE On 
Var(y) =| . , 
2 
Ont On2 nee On 


where ø? = Var(y;) and o; = Cov(y; yj). In other words, the variance-covariance matrix has the 
variances of each element of y down its diagonal, with covariance terms in the off diagonals. Because 
Cov(y;, yj) = Cov(y;, y;), it immediately follows that a variance-covariance matrix is symmetric. 


Properties of Variance. (1) If a is an nX1 nonrandom vector, then 
Var(a’y) = a’[Var(y)]a = 0; (2) If Var(a’y) > 0 for all a # 0, Var(y) is positive definite; 
(3) Var(y) = E[(y — a)(y — p)’], where u = E(y); (4) If the elements of y are uncorrelated, 
Var(y) is a diagonal matrix. If, in addition, Var(y,) = ø° for j = 1,2,...,n, then Var(y) = o°I,; 
and (5) If A is an m X n nonrandom matrix and b is an n X 1 nonrandom vector, then 
Var(Ay + b) = A[Var(y) JA’. 


D-7c Multivariate Normal Distribution 


The normal distribution for a random variable was discussed at some length in Math Refresher B. We 
need to extend the normal distribution to random vectors. We will not provide an expression for the prob- 
ability distribution function, as we do not need it. It is important to know that a multivariate normal ran- 
dom vector is completely characterized by its mean and its variance-covariance matrix. Therefore, if y is 
ann X 1 multivariate normal random vector with mean m and variance-covariance matrix $, we write 
y ~ Normal(,2). We now state several useful properties of the multivariate normal distribution. 


Properties of the Multivariate Normal Distribution. (1) If y ~ Normal(,), then 
each element of y is normally distributed; (2) If y ~ Normal(,), then y; and y;, any two elements 
of y, are independent if, and only if, they are uncorrelated, that is, o; = 0; (3) If y ~ Normal(y,=), 
then Ay + b ~ Normal(Ap + b,A2A’), where A and b are nonrandom; (4) If y ~ Normal(0,2), 
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then, for nonrandom matrices A and B, Ay and By are independent if, and only if, AXB’ = 0. 
In particular, if $ = o°I,, then AB’ = 0 is necessary and sufficient for independence of Ay and By; 
(5) If y ~ Normal(0,¢7I,), A is a k X n nonrandom matrix, and B is ann X n symmetric, idempo- 
tent matrix, then Ay and y’By are independent if, and only if, AB = 0; and (6) If y ~ Normal(0,o°I,,) 
and A and B are nonrandom symmetric, idempotent matrices, then y’Ay and y’By are independent if, 
and only if, AB = 0. 


D-7d Chi-Square Distribution 


In Math Refresher B, we defined a chi-square random variable as the sum of squared independent 
standard normal random variables. In vector notation, if u ~ Normal(0, I), then u'u ~ Xe. 


Properties of the Chi-Square Distribution. (1) If u ~ Normal(0,I,) and A is an 
n X n symmetric, idempotent matrix with rank(A) = q, then w'Au ~ x}; (2) If u ~ Normal(0,I,) 
and A and B are n X n symmetric, idempotent matrices such that AB = 0, then u'Au and u'Bu are 
independent, chi-square random variables; and (3) If z ~ Normal(0,C), where C is an m X m nons- 
ingular matrix, then z'C7'z ~ y7,. 


D-7e tDistribution 


We also defined the ¢ distribution in Math Refresher B. Now we add an important property. 


Property of the t Distribution. If u ~ Normal(0,I,), ¢ is an n X 1 nonrandom vec- 
tor, A is a nonrandom n X n symmetric, idempotent matrix with rank q, and Ac = 0, then 
{e'u/(e’e)'?}/(u'Aw/g)'” ~ t 


D-7f F Distribution 


Recall that an F random variable is obtained by taking two independent chi-square random variables 
and finding the ratio of each, standardized by degrees of freedom. 


Property of the F Distribution. If u ~ Normal(0,I,) and A and B are n X n non- 


random symmetric, idempotent matrices with rank(A) = kı, rank(B) = k», and AB = 0, then 
(u’Au/k, )/(u'Bu/k, ) =~ Fkk 


Summary 


This Advanced Treatment contains a condensed form of the background information needed to study 
the classical linear model using matrices. Although the material here is self-contained, it is primarily 
intended as a review for readers who are familiar with matrix algebra and multivariate statistics, and it 
will be used extensively in Advanced Treatment E. 


Key Terms 
Chi-Square Random Variable Idempotent Matrix Matrix Multiplication 
Column Vector Identity Matrix Multivariate Normal Distribution 
Diagonal Matrix Inverse Positive Definite (p.d.) 
Expected Value Linearly Independent Vectors Positive Semi-Definite (p.s.d.) 


F Random Variable Matrix Quadratic Form a 
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Random Vector Square Matrix Transpose 

Rank of a Matrix Symmetric Matrix Variance-Covariance Matrix 
Row Vector t Distribution Zero Matrix 

Scalar Multiplication Trace of a Matrix 


Problems 


1 


10 


11 


(i) Find the product AB using 


O œ =. 
oo O 


(ii) Does BA exist? 

If A and B are n X n diagonal matrices, show that AB = BA. 

Let X be any n X k matrix. Show that X’X is a symmetric matrix. 

(i) Use the properties of trace to argue that tr(A’A) = tr(AA’) for any n X m matrix A. 


0 -l 


2 
(ii) For A = | | verify that tr(A’A) = tr(AA’). 


0 3 

(i) Use the definition of inverse to prove the following: if A and B aren X n nonsingular matrices, then 

(AB)! = B'A! 

(ii) IfA, B, and C are all n X n nonsingular matrices, find (ABC)! in terms of A7!, B~}, and C™!. 

(i) Show that if A is ann X n symmetric, positive semi-definite matrix, then A must have nonnega- 
tive diagonal elements. 

(ii) Show that if A is ann X n symmetric, positive definite matrix, then A must have strictly positive 
diagonal elements. 


(iii) Write down a 2 X 2 symmetric matrix with strictly positive diagonal elements that is not positive 
definite. 


Let A be ann X n symmetric, positive definite matrix. Show that if P is any n X n nonsingular matrix, 
then P’AP is positive definite. 


Prove Property 5 of variances for vectors, using Property 3. 


Let a be ann X 1 nonrandom vector and let u be ann X 1 random vector with E(uu’) = I,,. Show that 
E[tr(auu’a’)] = X; a. 


Take as given the properties of the chi-square distribution listed in the text. Show how those properties, 
along with the definition of an F random variable, imply the stated property of the F distribution (con- 
cerning ratios of quadratic forms). 


Let X be ann X k matrix partitioned as 

X= (X, X,), 
where X, ism X k, and X, isn X ky. 
(i) Show that 


XiX, XiX, 
X'X = F 
XX, XX, 
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What are the dimensions of each of the matrices? 


(ii) Let b be ak X 1 vector, partitioned as 


where b; is kı X 1 and b, is k, X 1. Show that 


(xX)b = (ace + -a 


(X5X,)b, pa (X5X.)b, 


12 (i) Let A be ann X n symmetric matrix such that A and I, — A are both positive semi-definite. Show 
that 0 < a; <1 fori = 1,...,n, where a, is the i” diagonal element of A. 
(ii) Prove that if A is ann X n symmetric, idempotent matrix then it must be positive semi-definite. 
(iii) Prove that the only n X n symmetric, idempotent matrix that is also invertible is I,,. 


Mi Advanced Treatment E 


The Linear Regression Model 
in Matrix Form 


his Advanced Treatment derives various results for ordinary least squares estimation of the multi- 
ple linear regression model using matrix notation and matrix algebra (see Advanced Treatment D 
for a summary). The material presented here is much more advanced than that in the text. 


E-1 The Model and Ordinary Least 
Squares Estimation 


Throughout this Advanced Treatment, we use the f subscript to index observations and an n to denote 
the sample size. It is useful to write the multiple linear regression model with k parameters as follows: 


Yi = Bo + BiXy T Boxy a enca, PiX oe U, t = 1, 2; 122M, [E.1] 
where y, is the dependent variable for observation ¢ and x,, j = 1, 2,..., k, are the independent vari- 
ables. As usual, Bp is the intercept and £4, ... , B denote the slope parameters. 


For each t, define a 1 X (k + 1) vector, x, = (1, x,,...,X,), and let B = (Bo, Bis - - - , Ba)” be 
the (k + 1) X 1 vector of all parameters. Then, we can write (E.1) as 


y, = XB +u,f = 1,2,...,n. [E.2] 


[Some authors prefer to define x, as a column vector, in which case x, is replaced with x; in (E.2). 
Mathematically, it makes more sense to define it as a row vector.] We can write (E.2) in full matrix 
notation by appropriately defining data vectors and matrices. Let y denote the n X 1 vector of obser- 
vations on y: the £” element of y is y,. Let X be the n X (k + 1) vector of observations on the explan- 
atory variables. In other words, the £ row of X consists of the vector x,. Written out in detail, 


xX] Lox Xo ee XE 

x X2 l Xp, Xn -o Xk 
nX(k+1) |i] 

Xn l Xm m Xnk 
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Finally, let u be the n X 1 vector of unobservable errors or disturbances. Then, we can write (E.2) for 
all n observations in matrix notation: 


y= XB + u. [E.3] 


Remember, because X isn X (k + 1) and Bis (k + 1) X 1,XBisn X 1. 
Estimation of B proceeds by minimizing the sum of squared residuals, as in Section 3-2. Define 
the sum of squared residuals function for any possible (k + 1) X 1 parameter vector b as 


The (k + 1) X 1 vector of ordinary least squares estimates, Ê = (Bo, Bi. - - - , B,)', minimizes 
SSR(b) over all possible (k + 1) X 1 vectors b. This is a problem in multivariable calculus. For B to 
minimize the sum of squared residuals, it must solve the first order condition 


dSSR(B)/dab = 0. [E.4] 
Using the fact that the derivative of (y, — xb)? with respect to b is the 1 X (k + 1) vector 
—2(y, — x,b)x,, (E.4) is equivalent to 


Xx (y, E xÊ) =0. [E.5] 


(We have divided by —2 and taken the transpose.) We can write this first order condition as 


n 


X (y Bo Bix, U Bixa) = 0 
t=1 

Dx; Bo Bixa Ka Bixa) =0 
t=1 

xn: Bo Êixa U Brn) = 0, 
t=1 


which is identical to the first order conditions in equation (3.13). We want to write these in matrix 
form to make them easier to manipulate. Using the formula for partitioned multiplication in Advanced 
Treatment D, we see that (E.5) is equivalent to 


X'(y — XB) = 0 [E.6] 
or 
(X’X)B = X'y. [E.7] 


It can be shown that (E.7) always has at least one solution. Multiple solutions do not help us, as we are 
looking for a unique set of OLS estimates given our data set. Assuming that the (k + 1) X (k + 1) 
symmetric matrix X'X is nonsingular, we can premultiply both sides of (E.7) by (X’X) ~'! to solve for 
the OLS estimator B: 


Ê = (X'X) XY. [E.8] 


This is the critical formula for matrix analysis of the multiple linear regression model. The assump- 
tion that X’X is invertible is equivalent to the assumption that rank(X) = (k + 1), which means that 
the columns of X must be linearly independent. This is the matrix version of MLR.3 in Chapter 3. 
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Before we continue, (E.8) warrants a word of warning. It is tempting to simplify the formula for 
£ as follows: 


B = (X'X) X'y = X !(X') X'y = X'y, 
The flaw in this reasoning is that X is usually not a square matrix, so it cannot be inverted. In other 
words, we cannot write (X'X)~! = X~!(X')~! unless n = (k + 1), a case that virtually never arises 
in practice. 
The n X 1 vectors of OLS fitted values and residuals are given by 


y XÊ, û=y-ŷ=y XÊ, respectively. 
From (E.6) and the definition of û, we can see that the first order condition for B is the same as 
X't = 0. [E.9] 


Because the first column of X consists entirely of ones, (E.9) implies that the OLS residuals always 
sum to zero when an intercept is included in the equation and that the sample covariance between 
each independent variable and the OLS residuals is zero. (We discussed both of these properties in 
Chapter 3.) 

The sum of squared residuals can be written as 


SSR = Se = û'û = (y — XB)'(y — XB). [E.10] 


t=1 


All of the algebraic properties from Chapter 3 can be derived using matrix algebra. For example, 
we can show that the total sum of squares is equal to the explained sum of squares plus the sum of 
squared residuals [see (3.27)]. The use of matrices does not provide a simpler proof than summation 
notation, so we do not provide another derivation. 

The matrix approach to multiple regression can be used as the basis for a geometrical interpreta- 
tion of regression. This involves mathematical concepts that are even more advanced than those we 
covered in Advanced Treatment D. [See Goldberger (1991) or Greene (1997).] 


E-1a The Frisch-Waugh Theorem 


In Section 3-2, we described a “partialling out” interpretation of the ordinary least squares estimates. 
We can establish the partialling out interpretation very generally using matrix notation. Partition the 
n X (k + 1) matrix X as 


X = (XX3), 


where X, isn X (k; + 1) and includes the intercept—although that is not required for the result to 
hold—and X, isn X k,. We still assume that X has rank k + 1, which means X; has rank k; + 1 and 
X, has rank ky. 

Consider the OLS estimates Ê. and Ê- from the (long) regression 


y on X}, X. 


As we know, the multiple regression coefficients on X,, Ê generally differs from B, from the regres- 
sion y on X,. One way to describe the difference is to understand that we can obtain B, from a shorter 
regression, but first we must “partial out” X; from X,. Consider the following two-step method: 

(i) Regress (each column of) X, on X; and obtain the matrix of residuals, say X,. We can write 
X, as 
X, = (I, = X (XiX) "X, IX; = (I, ad P,)X; = MX,, 
where P, = X,(X‘{X,)~'X’ and M, = I, — P, aren X n symmetric, idempotent matrices. 


(ii) Regress y on X, and call the k, X 1 vector of coefficient B». 
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The Frisch-Waugh (FW) theorem states that 
b- = Bo. 
Importantly, the FW theorem generally says nothing about equality of the estimates from the 
long regression, Ê», and those from the short regression, Bp. Usually Ê- + Bo. However, if X;X, = 0 
then X, = M,X, = X,, in which case B, = Bo: then B = B- follows from FW. It is also worth not- 


ing that we obtain LA if we also partial X, out of y. In other words, let ¥ be the residuals from regress- 
ing y on X,, so that 


y = My. 
Then Ê- is obtained from the regression y on X,. It is important to understand that it is not enough to 
only partial out X, from y. The important step is partialling out X, from X,. Problem 6 at the end of 
this chapter asks you to derive the FW theorem and to investigate some related issues. 

Another useful algebraic result is that when we regress Y on X, and save the residuals, say ü, 
these are identical to the OLS residuals from the original (long) regression: 

y= Xb = ü = û = y - XB, — Xf, 
where we have used the FW result $, = B.. We do not obtain the original OLS residuals if we regress 
y on X, (but we do obtain ĝ»). 

Before the advent of powerful computers, the Frisch-Waugh result was sometimes used as a com- 
putational device. Today, the result is more of theoretical interest, and it is very helpful in under- 
standing the mechanics of OLS. For example, recall that in Chapter 10 we used the FW theorem to 
establish that adding a time trend to a multiple regression is algebraically equivalent to first linearly 
detrending all of the explanatory variables before running the regression. The FW theorem also can be 
used in Chapter 14 to establish that the fixed effects estimator, which we introduced as being obtained 
from OLS on time-demeaned data, can also be obtained from the (long) dummy variable regression. 


E-2 Finite Sample Properties of OLS 


Deriving the expected value and variance of the OLS estimator Ê is facilitated by matrix algebra, but 
we must show some care in stating the assumptions. 


Assumption E.1 Linear in Parameters 


The model can be written as in (E.3), where y is an observed n X 1 vector, X is an n X (k + 1) 
observed matrix, and u is an n X 1 vector of unobserved errors or disturbances. 


Assumption E.2 No Perfect Collinearity 


The matrix X has rank (k + 1). 


This is a careful statement of the assumption that rules out linear dependencies among the explanatory 
variables. Under Assumption E.2, X’X is nonsingular, so B is unique and can be written as in (E.8). 


Assumption E.3 Zero Conditional Mean 


Conditional on the entire matrix X, each error u; has zero mean: E(u;|X) = 0,t = 1,2,..., 
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In vector form, Assumption E.3 can be written as 
E(u|X) = 0. [E.11] 


This assumption is implied by MLR.4 under the random sampling assumption, MLR.2. In time series 
applications, Assumption E.3 imposes strict exogeneity on the explanatory variables, something dis- 
cussed at length in Chapter 10. This rules out explanatory variables whose future values are correlated 
with u, in particular, it eliminates lagged dependent variables. Under Assumption E.3, we can condi- 
tion on the x, when we compute the expected value of B. 


11'14'):14 9 ~UNBIASEDNESS OF OLS 


E.1 Under Assumptions E.1, E.2, and E.3, the OLS estimator Ê is unbiased for B. 


PROOF: Use Assumptions E.1 and E.2 and simple algebra to write 
Ê = (X'X) X'y = (X'X)X' (XB + u) 
= (X'X)1(X'X) B + (X'X)X'u = B + (X'X)X'u, [E.12] 
where we use the fact that (X'X)~'(X'X) = I,44. Taking the expectation conditional on X gives 
E(BIX) = B + (X’X)"*X’E(ulX) 
= B + (X'X)"'X'0 = B, 


because E(u|X) = 0 under Assumption E.3. This argument clearly does not depend on the value of 
B, so we have shown that Ê is unbiased. 


To obtain the simplest form of the variance-covariance matrix of B. we impose the assumptions 
of homoskedasticity and no serial correlation. 


Assumption E.4 (Homoskedasticity) 


Conditional on X, the variances are constant: 


Van DO i = hn acy lt 


As we discussed throughout the text, especially in Chapters 8 and 12, heteroskedasticity—which is 
failure of E.4—can never be ruled out for any of the data structures (cross section, time series, panel). 


Assumption E.5 (No Serial Correlation) 


Conditional on X, the errors are uncorrelated for all t # s: 


Cov(u,uUs|X) = 0, allt # s. 


Assumption E.5 is automatically satisfied under random sampling, which is why it does not appear 
until Chapter 10. With time series applications, Assumption E.5 means that the errors or innovations 
are uncorrelated across time. As we discussed in Chapters 10, 11, and 12, Assumption E.5 can be 
unrealistic, particularly in models that do not include lags of y,. (Including, say, y, , in x, is ruled out by 
Assumption E.3.) 

We can combine Assumptions E.4 and E.5 into a simple expression using matrix notation: 


Var(u|X) = o° I. [E.13] 
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Under this assumption, the n x n variance-covariance matrix Var(u|X) depends only on a single param- 
eter, 7°, and we often say that u has a scalar variance-covariance matrix. (The “scalar” is o°.) 
Assumptions E.1 through E.5 comprise the Gauss-Markov assumptions. The statements of the 
assumptions unify the conditions we used for cross-sectional analysis in Chapter 3 and time series 
analysis in Chapter 10. 
Using the concise expression in (E.13), we can derive the variance-covariance matrix of the 
OLS estimator under the Gauss-Markov Assumptions. 


111111 VARIANCE-COVARIANCE MATRIX OF THE OLS ESTIMATOR 
Under Assumptions E.1 through E.5, 


Var( B|X) = 0?(X'X)71. 


PROOF: From the last formula in equation (E.12), we have 
Var( BIX) = Var[(X'X)~1X'u|X] = (X'X)~1X’[Var(ulX) X(X'X) 71. 
Now, we use equation (E.13) to get 
Var( BIX) = (X'X)-'X'(e71,)X(X'X) 71 
= 0°(X'X) 'X'K(X’X) 1 = 0? (XX) 1. 


Expression (E.14) means that the variance of Ê; (conditional on X) is obtained by multiplying o° by 
the j diagonal element of (X'X)~'. For the slope coefficients, we gave an interpretable formula in 
equation (3.51). Equation (E.14) also tells us how to obtain the covariance between any two OLS 
estimates: multiply o° by the appropriate off-diagonal element of (X’X)~'. In Chapter 4, we showed 
how to avoid explicitly finding covariances for obtaining confidence intervals and hypothesis tests by 
appropriately rewriting the model. 

The Gauss-Markov Theorem, in its full generality, can be proven. 


GAUSS-MARKOV THEOREM 
Under Assumptions E.1 through E.5, B is the best linear unbiased estimator. 
PROOF : Any other linear estimator of B can be written as 
B=AY, [E.15] 


where A is ann X (k + 1) matrix. In order for B to be unbiased conditional on X, A can consist of 
nonrandom numbers and functions of X. (For example, A cannot be a function of y.) To see what fur- 
ther restrictions on A are needed, write 


B = A'(XB + u) = (AX)B + A'u. [E.16] 
Then, 


E( BX) = A'XB + E(A'u]X) 
= A’XB + A’E(ulX) because A is a function of X 
= A’XB because E(u|X) = 0. 
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For B to be an unbiased estimator of B, it must be true that E(B|X) = B for all (k + 1) x 1 vectors B, 
that is, 


A'XB = B for all (k + 1) x 1 vectors B. [E.17] 


Because A’X is a (k + 1) X (k + 1) matrix, (E.17) holds if, and only if, A’X = I,.,. Equations (E.15) 
and (E.17) characterize the class of linear, unbiased estimators for B. 
Next, from (E.16), we have 
Var(B|X) = A'[Var(u|X) JA = 07A’A, 
by equation (E.13). Therefore, 
Var(B|X) — Var(B|X) = 07[A’A — (X'X)'] 

= 0° [A'A — A’X(X'X) 'X'A] because A’X = I, ; 

= 0°A'[I, — X(X'X) 'X’JA 

= 0°A’MA, 
where M = l, — X(X'X)~'X’. Because M is symmetric and idempotent, A’MA is positive semi-definite 
for anyn X (k + 1) matrix A. This establishes that the OLS estimator Ê is BLUE. Why is this important? 


Let c be any (k + 1) x 1 vector and consider the linear combination c'B = CoBo + C181 + + Ckbe 
which is a scalar. The unbiased estimators of c’B are c’B and c'B. But 


Var(c’ BIX) — Var(c’BIX) = c’[Var(B|X) — Var(B|X) je = O, 


because [Var(B|X) — Var(B|X)] is p.s.d. Therefore, when it is used for estimating any linear combi- 
nation of B, OLS yields the smallest variance. In particular, Var(B|X) = Var(BX) for any other linear, 
unbiased estimator of 6). 


The unbiased estimator of the error variance g? can be written as 
= û'û/(n — k — 1), 


which is the same as equation (3.56). 


1 1a')14m UNBIASEDNESS OF ô? 
E.4 Under Assumptions E.1 through E.5, &? is unbiased for o°: E(é7|X) = a? for all o? > O. 
PROOF: Write ù = y — XB = y — X(X'X)~'X'y = My = Mu, where M = I, — X(X’X)~'X’, and the 
last equality follows because MX = 0. Because M is symmetric and idempotent, 
û'û = u’M’Mu = u’Mu. 
Because u’Mu is a scalar, it equals its trace. Therefore, 
E(u’Mu|X) = E[tr(u’Mu)|X] = E[tr(Muu’)|X] 
= tr[E(Muu’|X)] = tr[ ME(uu’|X) | 
= tr(Mo7l,) = otr(M) = 0°(n — k — 1). 


The last equality follows from tr(M) = tr(I,) — tr[X(X’X)'X’] =n — tr[(X'X) 'X’X] = 
n — tra.) = n-(k+1) =n-—k— 1. Therefore, 


E(67|X) = E(u’Mu|X)/(n — k — 1) = æ’. 
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E-3 Statistical Inference 


When we add the final classical linear model assumption, È has a multivariate normal distribution, 
which leads to the ¢ and F distributions for the standard test statistics covered in Chapter 4. 


Assumption E.6 Normality of Errors 


Conditional on X, the u; are independent and identically distributed as Normal(0,07). Equivalently, 


u given X is distributed as multivariate normal with mean zero and variance-covariance matrix 
a°l,; u ~ Normal(0,c7I,). 


Assumption E.6 implies Assumptions E.3, E.4, and E.5, but it is much stronger because it assumes 
that each u, has a Normal(0, o”) distribution. As a technical point, Assumption E.6 implies that the 
u, are actually independent across ¢ rather than merely uncorrelated. From a practical perspective, 
this distinction is unimportant. Assumptions E.1 through E.6 are the classical linear model (CLM) 
assumptions expressed in matrix terms, and they are usually viewed as the Gauss-Markov assump- 
tions plus normality of the errors. 


ILII E NORMALITY OF Ê 


E.5 Under the classical linear model Assumptions E.1 through E.6, B conditional on X is distributed as 
multivariate normal with mean B and variance-covariance matrix 07(X'X)~'. 


Theorem E.5 is the basis for statistical inference involving PB. In fact, along with the properties of the 
chi-square, t, and F distributions that we summarized in Advanced Treatment D, we can use Theorem 
E.5 to establish that ¢ statistics have a ¢ distribution under Assumptions E.1 through E.6 (under the 
null hypothesis) and likewise for F statistics. We illustrate with a proof for the f statistics. 


DISTRIBUTION OF t STATISTIC 
Under Assumptions E.1 through E.6, 
(Ê — B)/se(B) ~ trk- = 0,1,...,k. 


PROOF : The proof requires several steps; the following statements are initially conditional on X. First, 
by Theorem E.5, (Ê; — B,)/sd(B,) ~ Normal(0,1), where sd(ĝ;) = oVC,, and cy is the j diagonal ele- 
ment of (X’X)~'. Next, under Assumptions E.1 through E.6, conditional on X, 


(n — k — 1)67/a? ~ x2 4-4. [E.18] 


This follows because (n — k — 1)67/a? = (u/a)'M(u/o-), where M is the n X n symmetric, idem- 
potent matrix defined in Theorem E.3. But u/s ~ Normal(0,I,,) by Assumption E.6. It follows from 
Property 1 for the chi-square distribution in Advanced Treatment D that (u/a)'M(u/a) ~ x2-p-1 
(because M has rank n — k — 1). 

We also need to show that B and 6? are independent. But Ê = B + (X'X)™X'u, and 
6? = u'Mu/(n — k — 1). Now, [(X'X)7'X’]M = 0 because X'M = 0. It follows, from Property 5 of the 
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multivariate normal distribution in Advanced Treatment D, that Ê and Mu are independent. Because 6? 
is a function of Mu, B and 6? are also independent. 


(È; — B)/se(B) = KC; — B)/sd(B) (67/07) "?, 


which is the ratio of a standard normal random variable and the square root of a y2_,_,/(n — k — 1) 
random variable. We just showed that these are independent, so, by definition of a t random variable, 
(Ê; — B))/se(B)) has the t, ,; distribution. Because this distribution does not depend on X, it is the 
unconditional distribution of (Ê; — B)/se(Ê) as well. 


From this theorem, we can plug in any hypothesized value for £, and use the ¢ statistic for testing 
hypotheses, as usual. 

Under Assumptions E.1 through E.6, we can compute what is known as the Cramer-Rao lower 
bound for the variance-covariance matrix of unbiased estimators of B (again conditional on X) [see 
Greene (1997, Chapter 4)]. This can be shown to be o?(X'X)~!, which is exactly the variance- 
covariance matrix of the OLS estimator. This implies that B is the minimum variance unbiased 
estimator of B (conditional on X): Var(B|X) — Var(X) is positive semi-definite for any other 
unbiased estimator B; we no longer have to restrict our attention to estimators linear in y. 

It is easy to show that the OLS estimator is in fact the maximum likelihood estimator of 6 under 
Assumption E.6. For each 1, the distribution of y, given X is Normal(x, 8,0). Because the y, are 
independent conditional on X, the likelihood function for the sample is obtained from the product of 
the densities: 


I (210) expl- (y, — x)8)?/(202)], 


1=1 
where IT denotes product. Maximizing this function with respect to B and a” is the same as maximiz- 
ing its natural logarithm: 


S [-(12)10g(270°) ~ (y, ~ x/8)*/(20")] 


For obtaining Ê. this is the same as minimizing X,- (y, — x,8)’—the division by 207 does not affect 
the optimization—which is just the problem that OLS solves. The estimator of g” that we have used, 
SSR/(n — k), turns out not to be the MLE of a’; the MLE is SSR/n, which is a biased estimator. 
Because the unbiased estimator of g” results in f and F statistics with exact ¢ and F distributions under 
the null, it is always used instead of the MLE. 

That the OLS estimator is the MLE under Assumption E.6 implies an interesting robustness 
property of the MLE based on the normal distribution. The reasoning is simple. We know that the 
OLS estimator is unbiased under Assumptions E.1 to E.3; normality of the errors is used nowhere in 
the proof, and neither are Assumptions E.4 and E.5. As the next section shows, the OLS estimator is 
also consistent without normality, provided the law of large numbers holds (as is widely true). These 
statistical properties of the OLS estimator imply that the MLE based on the normal log-likelihood 
function is robust to distributional specification: the distribution can be (almost) anything and yet we 
still obtain a consistent (and, under E.1 to E.3, unbiased) estimator. As discussed in Section 17-3, a 
maximum likelihood estimator obtained without assuming the distribution is correct is often called a 
quasi-maximum likelihood estimator (QMLE). 

Generally, consistency of the MLE relies on having a correct distribution in order to conclude 
that it is consistent for the parameters. We have just seen that the normal distribution is a nota- 
ble exception. There are some other distributions that share this property, including the Poisson 
distribution—as discussed in Section 17-3. Wooldridge (2010, Chapter 18) discusses some other 
useful examples. 
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E-4 Some Asymptotic Analysis 


The matrix approach to the multiple regression model can also make derivations of asymptotic prop- 
erties more concise. In fact, we can give general proofs of the claims in Chapter 11. 

We begin by proving the consistency result of Theorem 11.1. Recall that these assumptions con- 
tain, as a special case, the assumptions for cross-sectional analysis under random sampling. 


Proof of Theorem 11.1. As in Problem E.1 and using Assumption TS.1’ we write the OLS 


estimator as 
rn) Bs) Bore) 
p+(S > xx, ) ( Exu) [E.19] 
- g+ ($a) (Seu) 


Now, by the law of large numbers, 


n` See > A andn- D xi = 0, [E.20] 
=1 


t=1 


where A = E(x/x,) is a (k + 1) X (k + 1) nonsingular matrix under Assumption TS.2’ and we 
have used the fact that E(x/u,) = 0 under Assumption TS.3’. Now, we must use a matrix version of 
Property PLIM.1 in Math Refresher C. Namely, because A is nonsingular, 


n -1I y 
Ta 2 av, [E.21] 
t=1 


[Wooldridge (2010, Chapter 3) contains a discussion of these kinds of convergence results.] It now 
follows from (E.19), (E.20), and (E.21) that 


plim(B) = B+ A!'-0=8. 


This completes the proof. 
Next, we sketch a proof of the asymptotic normality result in Theorem 11.2. 


Proof of Theorem 11.2. From equation (E.19), we can write 
1 n 
ValB- p)=(n'Sxix,) (1 Exu) 
. Pe u n) +o (1), [E.22] 


where the term “o pD” is a remainder term that converges in probability to zero. This term is equal 
to [(n 1 >. xx) 1 — A`! |(an £; xlu,). The term in brackets converges in probability to zero 
(by the same argument used in the proof of Theorem 11.1), while (n~'>\?_ ,x/u,) is bounded in prob- 
ability because it converges to a multivariate normal distribution by the central limit theorem. A well- 
known result in asymptotic theory is that the product of such terms converges in probability to zero. 
Further, Vn(Ê — B) inherits its asymptotic distribution from A~ !(n~'?"_ ,x'u,). See Wooldridge 
(2010, Chapter 3) for more details on the convergence results used in this proof. 
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By the central limit theorem, n™'” >_ ,x/u, has an asymptotic normal distribution with mean zero 
and, say, (k + 1) X (k + 1) variance-covariance matrix B. Then, Vn(B — B) has an asymptotic mul- 
tivariate normal distribution with mean zero and variance-covariance matrix A” 'BA~'. We now show 
that, under Assumptions TS.4’ and TS.5’, B = oA. (The general expression is useful because it under- 
lies heteroskedasticity-robust and serial correlation-robust standard errors for OLS, of the kind discussed 
in Chapter 12.) First, under Assumption TS.5’ x,u, and xju, are uncorrelated for t # s. Why? Suppose s 
< t for concreteness. Then, by the law of iterated expectations, E(x/u,u,x,) = E[E(u,u,x/x,)|x,x,] = 
E[O + x/x,] = 0. The zero covariances imply that the variance of the sum is the sum of the variances. But 
Var(x/u,) = E(x/uu,x,) = E(w,x;x,). By the law of iterated expectations, E(u;x/x,) =E[E(u;x/x,|x,) = 
E[E(u?|x,)x/x,] = Elo’x/x,] = o?E(x/x,) = o7A, where we use E(u?|x,) = o° under Assumptions 
TS.3' and TS.4’. This shows that B = oA, and so, under Assumptions TS.1’ to TS.5’, we have 


Vn(B — B) £ Normal(0,07A~'). [E.23] 


This completes the proof. 

From equation (E.23), we treat B as if it is approximately normally distributed with mean B 
and variance-covariance matrix 0A "n. The division by the sample size, n, is expected here: the 
approximation to the variance-covariance matrix of B shrinks to zero at the rate 1/n. When we replace 
a° with its consistent estimator, 6? = SSR/(n — k — 1), and replace A with its consistent estimator, 


n 


n '>!_\x/x, = X'X/n, we obtain an estimator for the asymptotic variance of B: 
Avar(B) = 6?(X'X)71. [E.24] 


Notice how the two divisions by n cancel, and the right-hand side of (E.24) is just the usual way we 
estimate the variance matrix of the OLS estimator under the Gauss-Markov assumptions. To sum- 
marize, we have shown that, under Assumptions TS.1’ to TS.5’—which contain MLR.1 to MLR.5 as 
special cases—the usual standard errors and f statistics are asymptotically valid. It is perfectly legiti- 
mate to use the usual f distribution to obtain critical values and p-values for testing a single hypoth- 
esis. Interestingly, in the general setup of Chapter 11, assuming normality of the errors—say, u, given 
X,, U;—1,X;—15 <-  , Uy, X4 is distributed as Normal(0, o*)—does not necessarily help, as the t statistics 
would not generally have exact f statistics under this kind of normality assumption. When we do not 
assume strict exogeneity of the explanatory variables, exact distributional results are difficult, if not 
impossible, to obtain. 

If we modify the argument above, we can derive a heteroskedasticity-robust, variance-covariance 
matrix. The key is that we must estimate E(u?x/x,) separately because this matrix no longer equals 
o’E(x/x,). But, if the ĉ, are the OLS residuals, a consistent estimator is 


(n— k- 1)! Dix! x,, [E.25] 
t=1 
where the division by n — k — 1 rather than n is a degrees of freedom adjustment that typically helps 
the finite sample properties of the estimator. When we use the expression in equation (E.25), we 
obtain 


—_— 


Avar(B) = [n/(n — k — DIAD S ixis (xx) [E.26] 


The square roots of the diagonal elements of this matrix are the same heteroskedasticity-robust stan- 
dard errors we obtained in Section 8-2 for the pure cross-sectional case. A matrix extension of the 
serial correlation- (and heteroskedasticity-) robust standard errors we obtained in Section 12-5 is also 
available, but the matrix that must replace (E.25) is complicated because of the serial correlation. See, 
for example, Hamilton (1994, Section 10-5). 
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E-4a Wald Statistics for Testing Multiple Hypotheses 


Similar arguments can be used to obtain the asymptotic distribution of the Wald statistic for testing 
multiple hypotheses. Let R be a q X (k + 1) matrix, with q = (k + 1). Assume that the q restric- 
tions on the (k + 1) X 1 vector of parameters, B, can be expressed as Hy: RB = r, where r is a 
q X 1 vector of known constants. Under Assumptions TS.1’ to TS.5’, it can be shown that, under Ho, 


[Vn(RB — r)]'(o?RA'R')"'[Vn(RB — r)] 4 x, [E.27] 


where A = E(x/x,), as in the proofs of Theorems 11.1 and 11.2. The intuition behind equa- 
tion (E.25) is simple. Because Vn(B — B) is roughly distributed as Normal(0,c7A~‘), 
R[-Vn(B — B)] = VnR(B — B) is approximately Normal(0,c°7RA 'R’) by Property 3 
of the multivariate normal distribution in Advanced Treatment D. Under Hp, RB = r, so 
Vn(RB — r) ~ Normal(0,07RA 'R’) under Hy. By Property 3 of the chi-square distribution, 
z'(@RA 'R') 'z ~ yz if z ~ Normal(0,0°RA ~R’). To obtain the final result formally, we need 
to use an asymptotic version of this property, which can be found in Wooldridge (2010, Chapter 3). 
Given the result in (E.25), we obtain a computable statistic by replacing A and g? with their 
consistent estimators; doing so does not change the asymptotic distribution. The result is the so-called 
Wald statistic, which, after canceling the sample sizes and doing a little algebra, can be written as 


W = (RB — r) '[R(X'X)'R' T (RÊ — r)/e’. [E.28] 


Under Hy, W £ Xe where we recall that g is the number of restrictions being tested. If 
ê = SSR/(n — k — 1), it can be shown that W/q is exactly the F statistic we obtained in Chapter 4 
for testing multiple linear restrictions. [See, for example, Greene (1997, Chapter 7).] Therefore, under 
the classical linear model assumptions TS.1 to TS.6 in Chapter 10, W/q has an exact F, n-g-1ı distribu- 
tion. Under Assumptions TS.1’ to TS.5’, we only have the asymptotic result in (E.26). Nevertheless, 
it is appropriate, and common, to treat the usual F statistic as having an approximate F, ,-,—1 
distribution. 

A Wald statistic that is robust to heteroskedasticity of unknown form is obtained by using the 
matrix in (E.26) in place of ô’ (X'X) ', and similarly for a test statistic robust to both heteroskedas- 
ticity and serial correlation. The robust versions of the test statistics cannot be computed via sums of 
squared residuals or R-squareds from the restricted and unrestricted regressions. 


Summary 


This Advanced Treatment has provided a brief treatment of the linear regression model using matrix nota- 
tion. This material is included for more advanced classes that use matrix algebra, but it is not needed to read 
the text. In effect, this Advanced Treatment proves some of the results that we either stated without proof, 
proved only in special cases, or proved through a more cumbersome method of proof. Other topics—such 
as asymptotic properties, instrumental variables estimation, and panel data models—can be given concise 
treatments using matrices. Advanced texts in econometrics, including Davidson and MacKinnon (1993), 
Greene (1997), Hayashi (2000), and Wooldridge (2010), can be consulted for details. 
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Problems 


1 Let x, be the 1 X (k + 1) vector of explanatory variables for observation t. Show that the OLS estima- 


tor B can be written as 
a n =i n 
p= (Èxx) (Sx). 
t=1 t=1 


Dividing each summation by n shows that Ê is a function of sample averages. 


2 Let Ê be the (k + 1) X 1 vector of OLS estimates. 
(i) Show that for any (k + 1) X 1 vector b, we can write the sum of squared residuals as 


SSR(b) = û'û + (È — b)'X'X(B — b). 


{Hint: Write (y — Xb)'(y — Xb) = [@ + X(B — b)]'[@ + X(B — b)] and use the fact that 

X'û = 0.} 

(ii) Explain how the expression for SSR(b) in part (i) proves that B uniquely minimizes SSR(b) over 
all possible values of b, assuming X has rank k + 1. 


3 Let B be the OLS estimate from the regression of y on X. Let A be a (k + 1) X (k + 1) nonsingular 
matrix and define z, = x,A,t = 1,...,n. Therefore, z, is 1 X (k+ 1) and isa nonsingular linear 
combination of x,. Let Z be the n X (k + 1) matrix with rows z,. Let B denote the OLS estimate from 
a regression of y on Z. 

(i) Show that B = AW'B. 

(ii) Let }, be the fitted values from the original regression and let ĵ, be the fitted values from regress- 
ing y on Z. Show that y, = ĵ, for allt = 1,2,...,. How do the residuals from the two regres- 
sions compare? 

(iii) Show that the estimated variance matrix for B is &°A~!(X'X) ŻA", where 6? is the usual vari- 
ance estimate from regressing y on X. 


(iv) Let the Ê; be the OLS estimates from regressing y, on 1, x, . - - , X and let the B, be the OLS 
estimates from the regression of y, on 1, a,X,,..., AX Where a; # 0, j = 1,...,k. Use the 


results from part (i) to find the relationship between the B; and the Ê; 

(v) Assuming the setup of part (iv), use part (iii) to show that se(ğ,) = se(ß,)/la]. 

(vi) Assuming the setup of part (iv), show that the absolute values of the f statistics for B; and 6; are 
identical. 


4 Assume that the model y = XB + u satisfies the Gauss-Markov assumptions, let G be a 
(k+ 1) X (k + 1) nonsingular, nonrandom matrix, and define 6 = GB, so that 6 is also a 
(k + 1) X 1 vector. Let È be the (k + 1) X 1 vector of OLS estimators and define 6 = Gf as the 
OLS estimator of 6. 

(i) Show that E(6|X) = ô. 

(ii) Find Var (8|X) in terms of o°, X, and G. 

(iii) Use Problem E.3 to verify that 6 and the appropriate estimate of Var(6 |X) are obtained from the 
regression of y on XG". 

(iv) Now, let c be a (k + 1) X 1 vector with at least one nonzero entry. For concreteness, assume that 
ck # 0. Define 0 = c'B, so that 6 is a scalar. Define 6; = B;,j = 0,1,...,k — land 6, = 0. 
Show how to define a (k + 1) X (k + 1) nonsingular matrix G so that 6 = G&B. (Hint: Each of 
the first k rows of G should contain k zeros and a one. What is the last row?) 
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(v) Show that for the choice of G in part (iv), 


1 0 0 0 
0 1 0 0 
G! = 
0 0 6 8 1 0 
| —Co/Cy Teee ` = = Cp y/o, Wey 


Use this expression for G~! and part (iii) to conclude that 6 and its standard error are obtained as the 
coefficient on x,/c; in the regression of 


y,on [1 mT (co/cy) Xn], [Xn = (cy/cy) Xn], tee ETEN = (cy_s/cy) Xu], Xy/Cyt = 1,..., 7. 


This regression is exactly the one obtained by writing 6, in terms of 0 and Bo, B;,..., By—1, plugging 
the result into the original model, and rearranging. Therefore, we can formally justify the trick we use 
throughout the text for obtaining the standard error of a linear combination of parameters. 


Assume that the model y = Xf + u satisfies the Gauss-Markov assumptions and let Ê be the 

OLS estimator of B. Let Z = G(X) be an n X (k + 1) matrix function of X and assume that 

Z'X[a(k + 1) X (k + 1) matrix] is nonsingular. Define a new estimator of B by B = (Z'X)7'!Z’y. 

(i) Show that E(B|X) = B, so that B is also unbiased conditional on X. 

(ii) Find Var(B|X). Make sure this is a symmetric, (k + 1) X (k + 1) matrix that depends on Z, X, 
and o°. 

(iii) Which estimator do you prefer, Ê or B? Explain. 


Consider the setup of the Frisch-Waugh Theorem. 

(i) Using partitioned matrices, show that the first order conditions (X'X) B = X'y can be written as 
X{XiB, + XiX, = X'y 
X2XıBı + Xj X.B, = Xy. 


(ii) Multiply the first set of equations by X5X,(X{X_,)~! and subtract the result from the second set 
of equations to show that 


(X3M)X,)B, = X4Mıy, 
where M, = I, — X,(X{X,)~'X}. Conclude that 
Bo a (XX) Xy. 
(iii) Use part (ii) to show that 
Ê: E (XX) X; y. 
(iv) Use the fact that M,X, = 0 to show that the residuals u from the regression y on X, are identical 
to the residuals û from the regression y on X,, X,. [Hint: By definition and the FW theorem, 


ü = y XB, = M,(y XÊ) = M: (y XÊ; XB»). 


Now you do the rest.] 
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7 Suppose that the linear model, written in matrix notation, 
y=XfPt+u 
satisfies Assumptions E.1, E.2, and E.3. Partition the model as 
y = XP, + Xb: + u, 
where X; isn X (k, + 1) and X, isn X k. 
(i) Consider the following proposal for estimating B,. First, regress y on X; and obtain the residuals, 
say, y. Then, regress y on X, to get É». Show that B, is generally biased and show what the bias 


is. [You should find E(B,|X) in terms of B5, X5, and the residual-making matrix M,.] 
(ii) As a special case, write 


y = XB, + BX, + u, 


where X; is ann X 1 vector on the variable x. Show that 


B(BIN) = (Ss )8 
f Ea 
SSR, is the sum of squared residuals from regressing x, on 1, X4, Xp, - -< , X, g-1- Why is the fac- 


tor multiplying 6, never greater than one? 


(iii) Suppose you know B,. Show that the regression y — X, 6, on X, produces an unbiased estimator 
of P» (conditional on X). 


8 In the context of multiple regression, define the n X n matrix 
M =I, — X(X'X) 1X’. 


(i) Show that M is symmetric and idempotent. 

(ii) Prove that m,, the diagonals of the matrix M, satisfy 0 =m, = 1 for t= 1,2,...,n. 

(iii) Consider the linear model y = Xf + u satisfies the Gauss-Markov Assumptions. Let û be the 
vector of OLS residuals. Show that 


E(a@a lx) = M 


(iv) Conclude that while the errors {u,: t= 1, 2,..., n} are homoskedastic and uncorrelated under the 
Gauss-Markov Assumptions, the OLS residuals are heteroskedastic and correlated. 


9 Consider the population model 


y=xß+u 
E(ulx) = 0, 
where the 1 X (k + 1) vector x is 
x= (1, %,,%,,...,%,). 


Let {(x,y,): i = 1,2,...,m} be a random sample. Show that Assumptions E.3 and E.5 hold. 
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Chapter 2 


Question 2.1: Equation (2.6) would hold when unobserved factors such as student ability, moti- 
vation, age, etc., are not related to attendance. In other words, the average value the unobservables (u) 
should not depend on the value of attendance (x). 


Question 2.2: About $11.05. To see this, from the average wages measured in 1976 and 2003 dollars, 
the CPI deflator is computed as 19.06/5.90 = 3.23. Multiplying $3.42 by 3.23 yields about $11.05. 


Question 2.3: The equation will be salaryhun = 9,631.91 + 185.01 roe as is easily seen by 
multiplying equation (2.39) by 10. 


Question 2.4: The equation will be salaryhun = 9,631.91 + 185.01 roe, as is easily seen by 
multiplying equation (2.39) by 10. 


Question 2.5: Equation (2.58) canbe written as Var( ĝo) = (°n (X i- AS (a; — x), 
where the term multiplying o’n‘' is greater than or equal to one, but it is equal to one if, and only if, 
x = 0. In this case, the variance is as small as it can possibly be: Var(B)) = o7/n. 


Chapter 3 


Question 3.1: Just a few factors include age and gender distribution, size of the police force (or, 
more generally, resources devoted to crime fighting), population, and general historical factors. These 
factors certainly might be correlated with prbconv and avgsen, which means equation (3.5) would 
not hold. For example, size of the police force is possibly correlated with both prbcon and avgsen, 
as some cities put more effort into crime prevention and law enforcement. We should try to bring as 
many of these factors into the equation as possible. 


Question 3.2: About 3.06. Using the third property of OLS concerning predicted val- 
ues and residuals, plug the average values of all independent variables into the OLS regression 
line to obtain the average value of the dependent variable. So colGPA = 1.29 + .453 hsGPA + 
0094 ACT = 1.29 + .453(3.4) + .0094(24.2) = 3.06. You can check the average of colGPA in GPA1 
to verify this to the second decimal place. 


775 
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Question 3.3: No. The variable shareA is not an exact linear function of expendA and expendB, 
even though it is an exact nonlinear function: shareA = 100-[expendA/(expendA + expendB) |. 
Therefore, it is legitimate to have expendA, expendB, and shareA as explanatory variables. 


Question 3.4: As we discussed in Section 3.4, if we are interested in the effect of x, on y, cor- 
relation among the other explanatory variables (x,, x}, and so on) does not affect Var(B,). These 
variables are included as controls, and we do not have to worry about collinearity among the control 
variables. Of course, we are controlling for them primarily because we think they are correlated with 
attendance, but this is necessary to perform a ceteris paribus analysis. 


Chapter 4 


Question 4.1: Under these assumptions, the Gauss-Markov assumptions are satisfied: u is inde- 
pendent of the explanatory variables, so E(u|x,,...,x,) = E(u), and Var(ulx,,...,x,) = Var(u). 
Further, it is easily seen that E(u) = 0. Therefore, MLR.4 and MLR.5 hold. The classical linear model 
assumptions are not satisfied because u is not normally distributed (which is a violation of MLR.6). 


Question 4.2: Ho: Bı = 0, Hy: 6, < 0. 


Question 4.3: Because Ê, = .56 > 0 and we are testing against H,: 8, > 0, the one-sided p- 
value is one-half of the two-sided p-value, or .043. 


Question 4.4: Hp: Bs = Bs = By = Bs = 0. k = 8 and q = 4. The restricted version of the 
model is 


score = By + Byclassize + B expend + B3tchcomp + Byenroll + u. 


Question 4.5: The F statistic for testing exclusion of ACT is [(.291 — .183)/ 
(1 — .291)](680 — 3) = 103.13. Therefore, the absolute value of the f statistic is about 10.16. The t 
statistic on ACT is negative, because Bycr is negative, so tycr = —10.16. 


Question 4.6: Not by much. The F test for joint significance of droprate and gradrate is eas- 
ily computed from the R-squareds in the table: F = [(.361 — .353)/(1 — .361) |(402/2) = 2.52. The 
10% critical value is obtained from Table G.3a as 2.30, while the 5% critical value from Table G.3b is 
3. The p-value is about .082. Thus, droprate and gradrate are jointly significant at the 10% level, but 
not at the 5% level. In any case, controlling for these variables has a minor effect on the b/s coefficient. 


Chapter 5 


Question 5.1: Assuming that B, > 0 (score depends positively on priGPA) and Cov(skipped, 
priGPA) < 0 (skipped and priGPA are negatively correlated), it follows that B,6,; < 0, which means 
that plim. Because £; is thought to be negative (or at least nonpositive), a simple regression is likely 
to overestimate the importance of skipping classes. 


Question 5.2: Ê; + 1.96se(B;) is the asymptotic 95% confidence interval. Or, we can replace 
1.96 with 2. 


Chapter 6 


Question 6.1: Because fincdol = 1,000-faminc, the coefficient on fincdol will be the coefficient 
on faminc divided by 1,000, or .0927/1,000 = .0000927. The standard error also drops by a factor of 
1,000, so the ż statistic does not change, nor do any of the other OLS statistics. For readability, it is 
better to measure family income in thousands of dollars. 
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Question 6.2: Use a more general form of the equation 
log(y) = Bo + Bilog(x;) + Box. ++ 


where x, is a proportion rather than a percentage. Then, ceteris paribus, Alog(y) = BA», 
100-Alog(y) = B,(100-Ax,), or %Ay = B,(100-Ax,). Now, because Ax, is the change in the pro- 
portion, 100-Ax, is a percentage point change. In particular, if Ax, = .01, then 100-Ax, = 1, which 
corresponds to a one percentage point change. But then $, is the percentage change in y when 
100-Ax, = 1. 


Question 6.3: The new model would be stndfnl = By + B,atndrte + B priGPA + B,ACT + 
BypriGPA* + BACT? + B.priGPA-atndrte + B,ACT-atndrte + u. Therefore, the partial effect of 
atndrte on stndfnl is B, + BępriGPA + BACT. This is what we multiply by Aatndrte to obtain the 
ceteris paribus change in stndfnl. 


Question 6.4: From equation (6.21), R? = 1 — 6?/[SST/(n — 1)]. For a given sample and a 
given dependent variable, SST/(n — 1) is fixed. When we use different sets of explanatory variables, 
only &” changes. As 6? decreases, R? increases. If we make G, and therefore 6’, as small as possible, 
we are making R? as large as possible. 


Question 6.5: Fora chosen sport—say, players in the National Basketball Association (NBA)— 
we can collect numerous statistics describing each player’s on-court performance. Just a handful of 
variables include games played, minutes played per game, points scored per game, rebounds per 
game, assists per game, and measures of defensive efficiency. One has latitude in the actual collec- 
tion of variables that indicate the productivity of NBA basketball players. Using data on salary and 
performance, we can run a regression of salary—or, because of the benefits of using the logarithm 
when a variable is a strictly positive monetary value probably, /salary = log(salary)—on the mea- 
sures of performance. The fitted values from the regression give us the predicted log salary based on 
peformance. Then, we can compute the residuals to see which players have actual /salary above the 
predicted value (the “overpaid” players) and which have negative residuals (the “underpaid” players). 
Remember, the residuals always add up to zero. Therefore, by construction, we must find some play- 
ers are “overpaid” and some are “underpaid.” 


Chapter 7 


Question 7.1: No, because it would not be clear when party is one and when it is zero. A better 
name would be something like Dem, which is one for Democratic candidates and zero for Republi- 
cans. Or, Rep, which is one for Republicans and zero for Democrats. 


Question 7.2: With outfield as the base group, we would include the dummy variables frstbase, 
scndbase, thrdbase, shrtstop, and catcher. 


Question 7.3: The null in this case is Hp: 6, = 6, = 6; = 6, = 0, so that there are four restric- 
tions. As usual, we would use an F test (where q = 4 and k depends on the number of other explana- 
tory variables). 


Question 7.4: Because tenure appears as a quadratic, we should allow separate quadratics for 


men and women. That is, we would add the explanatory variables female-tenure and female + tenure’. 


Question 7.5: We plug pcnv = 0, avgsen = 0, tottime = 0, ptime86 = 0, gemp86 = 4, 
black = 1, and hispan = 0 into equation (7.31): arr86 = .380 — .038(4) + .170 = .398, or almost 
.4. It is hard to know whether this is “reasonable.” For someone with no prior convictions who was 
employed throughout the year, this estimate might seem high, but remember that the population con- 
sists of men who were already arrested at least once prior to 1986. 
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Chapter 8 


Question 8.1: The statement is false. For example, in equation (8.7), the usual standard error for 
black is .147, while the heteroskedasticity-robust standard error is .118. 


Question 8.2: The F test would be obtained by regressing # on marrmale, marrfem, and sing- 
fem (singmale is the base group). With n = 526 and three independent variables in this regression, the 
df are 3 and 522. 


Question 8.3: Certainly the outcome of the statistical test suggests some cause for concern. 
A t Statistic of 2.96 is very significant, and it implies that there is heteroskedasticity in the wealth 
equation. As a practical matter, we know that the WLS standard error, .063, is substantially below 
the heteroskedasticity-robust standard error for OLS, .104, and so the heteroskedasticity seems to be 
practically important. (Plus, the nonrobust OLS standard error is .061, which is too optimistic. There- 
fore, even if we simply adjust the OLS standard error for heteroskedasticity of unknown form, there 
are nontrivial implications.) 


Question 8.4: The 1% critical value in the F distribution with (2, œ) df is 4.61. An F statistic of 
11.15 is well above the 1% critical value, and so we strongly reject the null hypothesis that the trans- 
formed errors, u;/ Vh; are homoskedastic. (In fact, the p-value is less than .00002, which is obtained 
from the F, gọ4 distribution.) This means that our model for Var(u|x) is inadequate for fully eliminat- 
ing the heteroskedasticity in u. 


Chapter 9 


Question 9.1: These are binary variables, and squaring them has no effect: black? = black, and 
hispan? = hispan. 


Question 9.2: When educ:IQ is in the equation, the coefficient on educ, say, B;, measures the 
effect of educ on log(wage) when JQ = 0. (The partial effect of education is 6; + B10.) There is no 
one in the population of interest with an IQ close to zero. At the average population IQ, which is 100, 
the estimated return to education from column (3) is .018 + .00034(100) = .052, which is almost 
what we obtain as the coefficient on educ in column (2). 


Question 9.3: No. If educ” is an integer—which means someone has no education past the pre- 
vious grade completed—the measurement error is zero. If educ* is not an integer, educ < educ”, so 
the measurement error is negative. At a minimum, e; cannot have zero mean, and e, and educ“ are 
probably correlated. 


Question 9.4: An incumbent’s decision not to run may be systematically related to how he or 
she expects to do in the election. Therefore, we may only have a sample of incumbents who are 
stronger, on average, than all possible incumbents who could run. This results in a sample selection 
problem if the population of interest includes all incumbents. If we are only interested in the effects of 
campaign expenditures on election outcomes for incumbents who seek reelection, there is no sample 
selection problem. 


Chapter 10 


Question 10.1: The impact propensity is .48, while the long-run propensity is .48 — .15 + 
32 = .65. 


Question 10.2: The explanatory variables are x,, = z; and x, = z,_,. The absence of perfect 
collinearity means that these cannot be constant, and there cannot be an exact linear relationship 
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between them in the sample. This rules out the possibility that all the z,,...,z, take on the same 
value or that the zp, z;,..., Z,—,; take on the same value. But it eliminates other patterns as well. For 
example, if z, = a + bt for constants a and b, then z,_, = a + b(t— 1) = (a + bt) —b=z,- b, 
which is a perfect linear function of z,. 


Question 10.3: If {z,} is slowly moving over time—as is the case for the levels or logs of many 
economic time series—then z, and z,_, can be highly correlated. For example, the correlation between 
unem, and unem,_, in PHILLIPS is .75. 


Question 10.4: No, because a linear time trend with a, < 0 becomes more and more negative as 
t gets large. Since gfr cannot be negative, a linear time trend with a negative trend coefficient cannot 
represent gfr in all future time periods. 


Question 10.5: The intercept for March is By + 6. Seasonal dummy variables are strictly exog- 
enous because they follow a deterministic pattern. For example, the months do not change based upon 
whether either the explanatory variables or the dependent variables change. 


Chapter 11 


Question 11.1: (i) No, because E(y,) = 5) + ôt depends on t. (ii) Yes, because y, — E(y,) = e, 
is an i.i.d. sequence. 


Question11.2: Wepluginfé = (1/2)inf,_, + (1/2)inf,_,intoinf, — inff = B,(unem, — mo) + e, 
and rearrange: inf, — (1/2)(inf,_, + inf,») = Bo + Biunem, + e, where By = —B Mo, as before. 
Therefore, we would regress y, on unem, where y, = inf, — (1/2)(inf,_,; + inf,—2). Note that we lose 
the first two observations in constructing y,. 


Question 11.3: No, because u, and u, are correlated. In particular, Cov(u,u,_,) = 
El(e, + aye,-,)(e,-; + aye,-2)] = a,E(e?_,) = ao? # 0 if a, # 0. If the errors are serially 
correlated, the model cannot be dynamically complete. 


Chapter 12 


Question 12.1: We use equation (12.4). Now, only adjacent terms are correlated. In particular, 
the covariance between x,u, and X,+ U,+ is X,X,4,Cov(u,,u;+1) = X,X;4 100%. Therefore, the formula is 


n n=l 
ss1,*( $e Varlu) + 2S x Cuan.) 
t=1 


t=1 


Var(B;) 


n-1 
= 0 /SST,, + (2/SST?) X ac2x,x,+ 
t=1 
n-1 


o°?/SST, + aa2(2/SST?) X x,X;41, 
t=1 


where o° = Var(u,) = o? + ało? = o?(1 + aî). Unless x, and x,,, are uncorrelated in the sample, 
the second term is nonzero whenever a, # 0. Notice that if x, and x,,, are positively correlated and 
a < 0, the true variance is actually smaller than the usual variance. When the equation is in levels (as 
opposed to being differenced), the typical case is a > 0, with positive correlation between x, and x, ;. 


Question 12.2: f + 1.96se(p), where se(ĝ) is the standard error reported in the regression. Or, 
we could use the heteroskedasticity-robust standard error. Showing that this is asymptotically valid is 
complicated because the OLS residuals depend on £, but it can be done. 
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Question 12.3: The model we have in mind is u, = p,u,—; + p4u,—4 + e,, and we want to test 
Ho: pı = 0, p4 = O against the alternative that Hp is false. We would run the regression of ii, on i,_ 
and i,_4 to obtain the usual F statistic for joint significance of the two lags. (We are testing two 
restrictions.) 


Question 12.4: We would to estimate the equation using first differences, as 6 = .92 is close 
enough to | to raise questions about the levels regression. See Chapter 18 for more discussion. 


Question 12.5: Because there is only one explanatory variable, the White test can be computed 
by regressing a on return,—, and return?_, (with an intercept, as always) and compute the F test for 
joint significance of return, , and return?_,. If these are jointly significant at a small enough signifi- 
cance level, we reject the null of homoskedasticity. 


Chapter 13 


Question 13.1: Yes, assuming that we have controlled for all relevant factors. The coefficient 
on black is 1.076, and, with a standard error of .174, it is not statistically different from 1. The 95% 
confidence interval is from about .735 to 1.417. 


Question 13.2: The coefficient on highearn shows that, in the absence of any change in the 
earnings cap, high earners spend much more time—on the order of 29.2% on average [because 
exp(.256) — 1 ~ .292]—on workers’ compensation. 


Question 13.3: E(v,) = E(a; + uj) = E(a;) + E(vy) = 0. Similarly, E(va) = 0. 
Therefore, the covariance between v, and vp is Eļ(vavp) = El(a; + uy) (a; + ujp)] = 
E(a?) + E(auj,) + Elaun) + E(ujuj) = E(a?), because all of the covariance terms are zero 
by assumption. But E(a?) = Var(a;), because E(a;) = 0. This causes positive serial correlation 
across time in the errors within each i, which biases the usual OLS standard errors in a pooled OLS 
regression. 


Question 13.4: Because Aadmn = admno — admng; is the difference in binary indicators, it 
can be —1 if, and only if, admnoy = 0 and admng; = 1. In other words, Washington state had an ad- 
ministrative per se law in 1985 but it was repealed by 1990. 


Question 13.5: No, just as it does not cause bias and inconsistency in a time series regression 
with strictly exogenous explanatory variables. There are two reasons it is a concern. First, serial cor- 
relation in the errors in any equation generally biases the usual OLS standard errors and test statistics. 
Second, it means that pooled OLS is not as efficient as estimators that account for the serial correla- 
tion (as in Chapter 12). 


Chapter 14 


Question 14.1: Whether we use first differencing or the within transformation, we will have 
trouble estimating the coefficient on kids,,. For example, using the within transformation, if kids; does 
not vary for family i, then kids, = kids,, — kids, = 0 for t = 1,2,3. As long as some families have 
variation in kids;,, then we can compute the fixed effects estimator, but the kids coefficient could be 
very imprecisely estimated. This is a form of multicollinearity in fixed effects estimation (or first- 
differencing estimation). 


Question 14.2: If a firm did not receive a grant in the first year, it may or may not receive a 
grant in the second year. But if a firm did receive a grant in the first year, it could not get a grant in 
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the second year. That is, if grant_, = 1, then grant = 0. This induces a negative correlation between 
grant and grant_,. We can verify this by computing a regression of grant on grant_,, using the data in 
JTRAIN for 1989. Using all firms in the sample, we get 


grant = .248 — .248 grant_, 
(.035) (.072) 
n = 157, R? = 070. 


The coefficient on grant_, must be the negative of the intercept because grant = 0 when grant_, = 1. 


Question 14.3: It suggests that the unobserved effect a; is positively correlated with union;;. 
Remember, pooled OLS leaves a; in the error term, while fixed effects removes a;. By definition, a; 
has a positive effect on log(wage). By the standard omitted variables analysis (see Chapter 3), OLS 
has an upward bias when the explanatory variable (union) is positively correlated with the omitted 
variable (a;). Thus, belonging to a union appears to be positively related to time-constant, unobserved 
factors that affect wage. 


Question 14.4: Not if all sisters within a family have the same mother and father. Then, 
because the parents’ race variables would not change by sister, they would be differenced away in 
equation (14.13). 


Chapter 15 


Question 15.1: Probably not. In the simple equation (15.18), years of education is part of the 
error term. If some men who were assigned low draft lottery numbers obtained additional schooling, 
then lottery number and education are negatively correlated, which violates the first requirement for 
an instrumental variable in equation (15.4). 


Question 15.2: (i) For equation (15.27), we require that high school peer group effects carry 
over to college. Namely, for a given SAT score, a student who went to a high school where smoking 
marijuana was more popular would smoke more marijuana in college. Even if the identification con- 
dition equation (15.27) holds, the link might be weak. 

(11) We have to assume that percentage of students using marijuana at a student’s high school is 
not correlated with unobserved factors that affect college grade point average. Although we are some- 
what controlling for high school quality by including SAT in the equation, this might not be enough. 
Perhaps high schools that did a better job of preparing students for college also had fewer students 
smoking marijuana. Or marijuana usage could be correlated with average income levels. These are, of 
course, empirical questions that we may or may not be able to answer. 


Question 15.3: Although prevalence of the NRA and subscribers to gun magazines are probably 
correlated with the presence of gun control legislation, it is not obvious that they are uncorrelated 
with unobserved factors that affect the violent crime rate. In fact, we might argue that a population 
interested in guns is a reflection of high crime rates, and controlling for economic and demographic 
variables is not sufficient to capture this. It would be hard to argue persuasively that these are truly 
exogenous in the violent crime equation. 


Question 15.4: As usual, there are two requirements. First, it should be the case that growth 
in government spending is systematically related to the party of the president, after netting out the 
investment rate and growth in the labor force. In other words, the instrument must be partially cor- 
related with the endogenous explanatory variable. While we might think that government spend- 
ing grows more slowly under Republican presidents, this certainly has not always been true in 
the United States and would have to be tested using the ¢ statistic on REP,_, in the reduced form 
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gGov, = To + 7,REP,_, + 7,INVRAT, + 73gLAB, + v, We must assume that the party of the 
president has no separate effect on gGDP. This would be violated if, for example, monetary policy 
differs systematically by presidential party and has a separate effect on GDP growth. 


Chapter 16 


Question 16.1: Probably not. It is because firms choose price and advertising expenditures 
jointly that we are not interested in the experiment where, say, advertising changes exogenously and 
we want to know the effect on price. Instead, we would model price and advertising each as a function 
of demand and cost variables. This is what falls out of the economic theory. 


Question 16.2: We must assume two things. First, money supply growth should appear in 
equation (16.22), so that it is partially correlated with inf. Second, we must assume that money sup- 
ply growth does not appear in equation (16.23). If we think we must include money supply growth in 
equation (16.23), then we are still short an instrument for inf. Of course, the assumption that money 
supply growth is exogenous can also be questioned. 


Question 16.3: Use the Hausman test from Chapter 15. In particular, let $, be the OLS residuals 
from the reduced form regression of open on log(pcinc) and log(/and). Then, use an OLS regression 
of inf on open, log(pcinc), and >, and compute the f statistic for significance of 15. If }, is significant, 
the 2SLS and OLS estimates are statistically different. 


Question 16.4: The demand equation looks like 


log(fish,) = By + B,log(prefish,) + Bplog(inc,) 
+ B;log(prechick,) + Bylog(prcbeef,) + un, 


where logarithms are used so that all elasticities are constant. By assumption, the demand func- 
tion contains no seasonality, so the equation does not contain monthly dummy variables (say, 
feb, mar, ..., dec, with January as the base month). Also, by assumption, the supply of fish is sea- 
sonal, which means that the supply function does depend on at least some of the monthly dummy 
variables. Even without solving the reduced form for log(prcfish), we conclude that it depends 
on the monthly dummy variables. Since these are exogenous, they can be used as instruments for 
log(prcfish) in the demand equation. Therefore, we can estimate the demand-for-fish equation using 
monthly dummies as the IVs for log(prefish). Identification requires that at least one monthly dummy 
variable appears with a nonzero coefficient in the reduced form for log(prefish). 


Chapter 17 


Question 17.1: Hy: B, = Bs = Bs = 0, so that there are three restrictions and therefore three df 
in the LR or Wald test. 


Question 17.2: We need the partial derivative of ®(ĝ, + B,nwifeinc + educ + 
Bsexper + Byexper? +--+) with respect to exper, which is $(-)(B; + 2Byexper), where 
o(-) is evaluated at the given values and the initial level of experience. Therefore, we 
need to evaluate the standard normal probability density at .270 — .012(20.13) 4 
.131(12.3) + .123(10) — .0019(107) — .053(42.5) — .868(0) + .036(1) = .463, where we plug 
in the initial level of experience (10). But #(.463) = (2m) 'exp[—(.4637)/2] = .358. Next, we 
multiply this by Ê; + 2Byexper, which is evaluated at exper = 10. The partial effect using the calcu- 
lus approximation is .358[.123 — 2(.0019)(10)] = .030. In other words, at the given values of the 
explanatory variables and starting at exper = 10, the next year of experience increases the probability 
of labor force participation by about .03. 
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Question 17.3: No. The number of extramarital affairs is a nonnegative integer, which presum- 
ably takes on zero or small numbers for a substantial fraction of the population. It is not realistic to 
use a Tobit model, which, while allowing a pileup at zero, treats y as being continuously distributed 
over positive values. Formally, assuming that y = max(0, y“), where y* is normally distributed, is at 
odds with the discreteness of the number of extramarital affairs when y > 0. 


Question 17.4: The adjusted standard errors are the usual Poisson MLE standard errors 
multiplied by 6 = V2 ~ 1.41, so the adjusted standard errors will be about 41% higher. The 
quasi-LR statistic is the usual LR statistic divided by 6’, so it will be one-half of the usual LR statistic. 


Question 17.5: By assumption, mvp; = Po + x; + u; where, as usual, x;ßB denotes a linear 
function of the exogenous variables. Now, observed wage is the largest of the minimum wage and the 
marginal value product, so wage; = max(minwage,,mvp;), which is very similar to equation (17.34), 
except that the max operator has replaced the min operator. 


Chapter 18 


Question 18.1: We can plug these values directly into equation (18.1) and take expecta- 
tions. First, because z, = 0, for all s < 0, y-; = æ + u_,. Then, z = 1, so yy =a + ôb + uo. 
For h=1, y,=a+6,_; + ô, + u,. Because the errors have zero expected values, 
E(y_,) = a, E(yp) =a + ô, and E(y,) =a + 6,-; + ô, for all h=1. As h> œ, ô, > 0. 
It follows that E(y,) > a as h > %, that is, the expected value of y, returns to the expected value 
before the increase in z, at time zero. This makes sense: although the increase in z lasted for two 
periods, it is still a temporary increase. 


Question 18.2: Under the described setup, Ay, and Ax, are i.i.d. sequences that are independent 
of one another. In particular, Ay, and Ax, are uncorrelated. If 7, is the slope coefficient from regressing 
Ay, on Ax, t = 1, 2,..., n, then plim y, = 0. This is as it should be, as we are regressing one I(0) pro- 
cess on another I(0) process, and they are uncorrelated. We write the equation Ay, = yo + y,Ax, + en, 
where yọ = yı = 0. Because {e,} is independent of {Ax,}, the strict exogeneity assumption holds. 
Moreover, {e,} is serially uncorrelated and homoskedastic. By Theorem 11.2 in Chapter 11, the ¢ sta- 
tistic for 7, has an approximate standard normal distribution. If e, is normally distributed, the classical 
linear model assumptions hold, and the ¢ statistic has an exact rf distribution. 


Question 18.3: Write x, = x,., + a,, where {a,} is I(0). By assumption, there is a linear combi- 
nation, say, s, = y, — Bx, which is I(0). Now, y, — Bx,-, = y, — B(x, — @,) = s, + Ba, Because s, 
and a, are I(0) by assumption, so is s, + Ba,. 


Question 18.4: Just use the sum of squared residuals form of the F test and assume 
homoskedasticity. The restricted SSR is obtained by regressing Ahy6, — Ahy3,_, + 
(hy6,-, — hy3,-.) on a constant. Notice that aj is the only parameter to estimate in 
Ahy6, = a + y¥oAhy3,-, + 5(hy6,, — hy3,—>) when the restrictions are imposed. The unrestricted 
sum of squared residuals is obtained from equation (18.39). 


Question 18.5: We are fitting two equations: }, = â + Êt and $, = Ẹ + dyear,. We can obtain 
the relationship between the parameters by noting that year, = t + 49. Plugging this into the sec- 
ond equation gives ĵ, = 7 + ê(t + 49) = (¥ + 495) + ôt. Matching the slope and intercept with 
the first equation gives ô = B—so that the slopes on ¢ and year, are identical—and @ = y + 496. 
Generally, when we use year rather than f, the intercept will change, but the slope will not. (You can ver- 
ify this by using one of the time series data sets, such as HSEINV or INVEN.) Whether we use f or some 
measure of year does not change fitted values, and, naturally, it does not change forecasts of future val- 
ues. The intercept simply adjusts appropriately to different ways of including a trend in the regression. 
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TABLE G.1 Cumulative Areas under the Standard Normal Distribution 
z 0 1 2 3 4 5 6 7 8 9 
—3.0 0.0013 0.0013 0.0013 0.0012 0.0012 0.0011 0.0011 0.0011 0.0010 0.0010 
—2.9 0.0019 0.0018 0.0018 0.0017 0.0016 0.0016 0.0015 0.0015 0.0014 0.0014 
—2.8 0.0026 0.0025 0.0024 0.0023 0.0023 0.0022 0.0021 0.0021 0.0020 0.0019 
—2.7 0.0035 0.0034 0.0033 0.0032 0.0031 0.0030 0.0029 0.0028 0.0027 0.0026 
—2.6 0.0047 0.0045 0.0044 0.0043 0.0041 0.0040 0.0039 0.0038 0.0037 0.0036 
—2.5 0.0062 0.0060 0.0059 0.0057 0.0055 0.0054 0.0052 0.0051 0.0049 0.0048 
—2.4 0.0082 0.0080 0.0078 0.0075 0.0073 0.0071 0.0069 0.0068 0.0066 0.0064 
—2.3 0.0107 0.0104 0.0102 0.0099 0.0096 0.0094 0.0091 0.0089 0.0087 0.0084 
—2.2 0.0139 0.0136 0.0132 0.0129 0.0125 0.0122 0.0119 0.0116 0.0113 0.0110 
—2.1 0.0179 0.0174 0.0170 0.0166 0.0162 0.0158 0.0154 0.0150 0.0146 0.0143 
—2.0 0.0228 0.0222 0.0217 0.0212 0.0207 0.0202 0.0197 0.0192 0.0188 0.0183 
—1.9 0.0287 0.0281 0.0274 0.0268 0.0262 0.0256 0.0250 0.0244 0.0239 0.0233 
—1.8 0.0359 0.0351 0.0344 0.0336 0.0329 0.0322 0.0314 0.0307 0.0301 0.0294 
—1.7 0.0446 0.0436 0.0427 0.0418 0.0409 0.0401 0.0392 0.0384 0.0375 0.0367 
—1.6 0.0548 0.0537 0.0526 0.0516 0.0505 0.0495 0.0485 0.0475 0.0465 0.0455 
—1.5 0.0668 0.0655 0.0643 0.0630 0.0618 0.0606 0.0594 0.0582 0.0571 0.0559 
—1.4 0.0808 0.0793 0.0778 0.0764 0.0749 0.0735 0.0721 0.0708 0.0694 0.0681 
—1.3 0.0968 0.0951 0.0934 0.0918 0.0901 0.0885 0.0869 0.0853 0.0838 0.0823 
—1.2 0.1151 0.11381 0.1112 0.1093 0.1075 0.1056 0.1038 0.1020 0.1003 0.0985 
a | 0.1357 0.13385 0.1314 0.1292 0.1271 0.1251 0.1230 0.1210 0.1190 0.1170 
—1.0 0.1587 0.1562 0.1539 0.1515 0.1492 0.1469 0.1446 0.1423 0.1401 0.1379 
—0.9 0.1841 0.1814 0.1788 0.1762 0.1736 0.1711 0.1685 0.1660 0.1635 0.1611 
—0.8 0.2119 0.2090 0.2061 0.2033 0.2005 0.1977 0.1949 0.1922 0.1894 0.1867 
—0.7 0.2420 0.2389 0.2358 0.2327 0.2296 0.2266 0.2236 0.2206 0.2177 0.2148 
—0.6 0.2743 0.2709 0.2676 0.2643 0.2611 0.2578 0.2546 0.2514 0.2483 0.2451 
—0.5 0.3085 0.3050 0.3015 0.2981 0.2946 0.2912 0.2877 0.2843 0.2810 0.2776 
—0.4 0.3446 0.3409 0.3372 0.3336 0.3300 0.3264 0.3228 0.3192 0.3156 0.3121 


(continued) 
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TABLE G.1 (Continued) 


Z 0 1 2 3 4 5 6 7 8 9 
=0:3 0.3821 0.3783 0.3745 0.3707 0.3669 0.3632 0.3594 0.3557 0.3520 0.3483 
—0.2 0.4207 0.4168 0.4129 0.4090 0.4052 0.4013 0.3974 0.3936 0.3897 0.3859 
FOA 0.4602 0.4562 0.4522 0.4483 0.4443 0.4404 0.4364 0.4325 0.4286 0.4247 
—0.0 0.5000 0.4960 0.4920 0.4880 0.4840 0.4801 0.4761 0.4721 0.4681 0.4641 

0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359 
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5596 0.5636 0.5675 0.5714 0.5753 
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141 
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517 
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879 
0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224 
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549 
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852 
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133 
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389 
1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8577 0.8599 0.8621 
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830 
1&2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015 
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177 
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319 
1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441 
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545 
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633 
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706 
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767 
2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817 
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857 
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890 
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916 
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936 
2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952 
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964 
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974 
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981 
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986 
3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990 


Examples: If Z ~ Normal(0, 1), then P(Z < —1.32) = .0934 and P(Z < 1.84) = .9671. 
Source: This table was generated using the Stata® function normal. 
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TABLE G.2 Critical Values of the t Distribution 


Significance Level 
1-Tailed: 10 .05 025 .01 .005 
2-Tailed: .20 10 .05 .02 .01 

1 3.078 6.314 12.706 31.821 63.657 

2 1.886 2.920 4.303 6.965 9.925 

3 1.638 2.353 3.182 4.541 5.841 

4 1.533 2.132 2.776 3.747 4.604 

5 1.476 2.015 2 5al 3.365 4.032 

6 1.440 1.943 2.447 3.143 3.707 

Y 1.415 1.895 2.365 2.998 3.499 

8 1.397 1.860 2.306 2.896 3.355 

9 1.383 1.833 2.262 2.821 3.250 

10 1.372 1.812 2.228 2.764 3.169 

11 1.363 1.796 2.201 2218) 3.106 

12 1.356 1.782 2.179 2.681 3.055 
g 13 1.350 TAAL 2.160 2.650 3.012 
r 14 1.345 1.761 2.145 2.624 2.977 
$ 15 1.341 i53 2.131 2.602 2.947 
s 16 1.337 1.746 2.120 2.583 2.921 
17 1.333 1.740 2.110 2.567 2.898 

T 18 1.330 1.734 2.101 2.552 2.878 
19 1.328 1.729 2.093 2.539 2.861 

F 20 1.325 1.725 2.086 2.528 2.845 
g 21 1.323 iTA 2.080 2.518 2.831 
: 22 1.321 1.717 2.074 2.508 2.819 
d 23 1.319 1.714 2.069 2.500 2.807 
o 24 1.318 1.711 2.064 2.492 2.797 
i 25 1.316 1.708 2.060 2.485 2.787 
26 1.315 1.706 2.056 2.479 2.779 

27 1.314 1.703 2.052 2.473 PATA 

28 1.313 1.701 2.048 2.467 2.763 

29 1.311 1.699 2.045 2.462 2.756 

30 1.310 1.697 2.042 2.457 2.750 

40 1.303 1.684 2.021 2.423 2.704 

60 1.296 1.671 2.000 2.390 2.660 

90 1.291 1.662 1.987 2.368 2.632 

120 1.289 1.658 1.980 2.358 2.617 

oo 1.282 1.645 1.960 2.326 2.576 


Examples: The 1% critical value for a one-tailed test with 25 df is 2.485. The 5% critical value for a two-tailed test with large 
(> 120) dfis 1.96. 


Source: This table was generated using the Stata® function invttail. 
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TABLE G.3a 10% Critical Values of the F Distribution 


Numerator Degrees of Freedom 
1 2 3 4 5 6 7 8 9 10 

10 3.29 2.92 2.73 2.61 2.52 2.46 2.41 2.38 2235 2.32 
D 11 3.23 2.86 2.66 2.54 2.45 2.39 2.34 2.30 2.27 2.25 
e 12 318 2.81 2.61 2.48 2139 2.33 2.28 2.24 2.21 2.19 
n 13 3.14 2.76 2.56 2.43 2:35 2.28 2.23 2.20 2.16 2.14 
7 14 3.10 213 2.52 2.39 2:33 2.24 2.19 215 2.12 2.10 
i 15 3.07 2.70 2.49 2.36 2.27 2.21 2.16 2.12 2.09 2.06 
n 16 3.05 2.67 2.46 2.33 2.24 2.18 2 13 2.09 2.06 2.03 
7 17 3.03 2.64 2.44 2.31 2.22 2.15 2.10 2.06 2.03 2.00 
5 18 3.01 2.62 2.42 2.29 2.20 2.13 2.08 2.04 2.00 1.98 
r 19 2.99 2.61 2.40 2.27 2.18 2.11 2.06 2.02 1.98 1.96 

20 2.97 2.59 2.38 2.25 2.16 2.09 2.04 2.00 1.96 1.94 
21 2.96 2.57 2.36 2.23 2.14 2.08 2.02 1.98 1.95 1.92 
g 22 2.95 2.56 2.35 2.22 2.13 2.06 2.01 1.97 1.93 1.90 
r 23 2.94 2.55 2.34 2.21 2.11 2.05 1.99 1.95 1.92 1.89 
24 2.93 2.54 2.33 2.19 2.10 2.04 1.98 1.94 1.91 1.88 
s 25 2.92 2.53 2.32 2.18 2.09 2.02 1.97 1.93 1.89 1.87 

26 2.91 2.52 2.31 227 2.08 2.01 1.96 1.92 1.88 1.86 
y 27 2.90 2.51 2.30 2.17 2.07 2.00 1.95 1.91 1.87 1.85 

28 2.89 2.50 2.29 2.16 2.06 2.00 1.94 1.90 1.87 1.84 
F 29 2.89 2.50 2.28 2.15 2.06 1.99 1.93 1.89 1.86 1.83 
zi 30 2.88 2.49 2.28 2.14 2.05 1.98 1.93 1.88 1.85 1.82 
e 40 2.84 2.44 2.23 2.09 2.00 1.93 1.87 1.83 1.79 1.76 
d 60 2.79 2.39 2.18 2.04 1.95 1.87 1.82 Walle 1.74 1.71 
o 90 2.76 2.36 2.15 2.01 1.91 1.84 1.78 1.74 1.70 1.67 
Ht 120 2.75 2.35 2.13 1.99 1.90 1.82 UGA 1.72 1.68 1.65 

oo 2.71 2.30 2.08 1.94 1.85 1:77 1.72 1.67 1.63 1.60 


Example: The 10% critical value for numerator df = 2 and denominator df = 40 is 2.44. 
Source: This table was generated using the Stata® function invFtail. 
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TABLE G.3b 5% Critical Values of the F Distribution 


Numerator Degrees of Freedom 
1 2 3 4 5 6 7 8 9 10 

D 10 4.96 4.10 3.71 3.48 3.33 3.22 3.14 3.07 3.02 2.98 
e g 4.84 3.98 3.59 3°36 3.20 3.09 3.01 2.95 2.90 2.85 
n 12 4.75 3.89 3.49 3.26 3.11 3.00 2.91 2.85 2.80 2.75 
o 3 4.67 3.81 3.41 3.18 3.03 2.92 2.83 2.77 PATA 2.67 
m ë 44 4.60 3.74 3.34 3:11 2.96 2.85 2.76 2.70 2.65 2.60 
i 15 4.54 3.68 3.29 3.06 2.90 2.79 EATEN 2.64 2.59 2.54 
S 16 4.49 3.63 3.24 3.01 2.85 2.74 2.66 2.59 2.54 2.49 
t Ie 4.45 3.59 3.20 2.96 2.81 2.70 2.61 2.55 2.49 2.45 
o 18 4.41 3.55 3.16 2.93 2.77 2.66 2.58 2.51 2.46 2.41 
r 19 4.38 3.52 Bal 2.90 2.74 2.63 2.54 2.48 2.42 2.38 

20 4.35 3.49 3.10 2.87 2.71 2.60 2.51 2.45 2.39 2.35 
p 21 4.32 3.47 3.07 2.84 2.68 2:5 2.49 2.42 2.3m 2.32 
g 22 4.30 3.44 3.05 2.82 2.66 2.55 2.46 2.40 2.34 2.30 
r 23 4.28 3.42 3.03 2.80 2.64 2.53 2.44 2.37 2.32 22T 
e 24 4.26 3.40 3.01 2.78 2.62 2.51 2.42 2.36 2.30 2.25 
e 25 4.24 3.39 2.99 2.76 2.60 2.49 2.40 2.34 2.28 2.24 
> 26 4.23 3.37 2.98 2.74 2.59 2.47 2.39 2.32 2.27 2.22 
o 27 4.21 3.35 2.96 2.73 2.57 2.46 2.37 2.31 2.25 2.20 
f 28 4.20 3.34 2.95 2.71 2.56 2.45 2.36 2.29 2.24 2.19 

29 4.18 3.33 2.93 2.70 2.55 2.43 2.35 2.28 2.22 2.18 
F 30 4.17 3.32 2.92 2.69 2.53 2.42 2.33 2.27 2:21 2.16 
A 40 4.08 323 2.84 2.6 2.45 2.34 2:25 2.18 2.12 2.08 
a 60 4.00 3.15 2.76 2.53 2.37 2.25 2.17 2.10 2.04 1.99 
d 90 3:95 3.10 2a Tal 2.47 232 2.20 Pala 2.04 1.99 1.94 
o 120 3.92 3.07 2.68 2.45 2.29 2.17 2.09 2.02 1.96 1.91 
m oo 3.84 3.00 2.60 2.37 2.21 2.10 2.01 1.94 1.88 1.83 


Example: The 5% critical value for numerator df = 4 and large denominator df(~) is 2.37. 
Source: This table was generated using the Stata® function invFtail. 


TABLE G.3c 1% Critical Values of the F Distribution 
Numerator Degrees of Freedom 
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Example: The 1% critical value for numerator df = 3 and denominator df = 60 is 4.13. 


Source: This table was generated using the Stata® function invFtail. 


1 2 3 4 5 6 7 8 9 10 

10 10.04 7.56 6.55 5.99 5.64 5.39 5.20 5.06 4.94 4.85 

D 11 9.65 CPA 6.22 5.67 5.32 5.07 4.89 4.74 4.63 4.54 
: 12 9.33 6.93 5.95 5.41 5.06 4.82 4.64 4.50 4.39 4.30 
a 13 9.07 6.70 5.74 5.21 4.86 4.62 4.44 4.30 4.19 4.10 
m 14 8.86 6.51 5.56 5.04 4.69 4.46 4.28 4.14 4.03 3.94 
i 15 8.68 6.36 5.42 4.89 4.56 4.32 4.14 4.00 3.89 3.80 
n 46 8.53 6.23 5.29 4.77 4.44 4.20 4.03 3.89 3.78 3.69 
: 17 8.40 6.11 5.18 4.67 4.34 4.10 3.93 3.79 3.68 3.59 
o 18 8.29 6.01 5.09 4.58 4.25 4.01 3.84 3.71 3.60 3.51 
r 19 8.18 5.93 5.01 4.50 4.17 3.94 SEITE 3.63 3.52 3.43 
20 8.10 5.85 4.94 4.43 4.10 3.87 3.70 3.56 3.46 3.37 

p 21 8.02 5:78 4.87 4.37 4.04 3.81 3.64 3.51 3.40 3.31 
g 22 7.95 5.72 4.82 4.31 3.99 3.76 3.59 3.45 3.35 3.26 
ro 23 7.88 5.66 4.76 4.26 3.94 3.71 3.54 3.41 3.30 3.21 
e 24 7.82 5.61 4.72 4.22 3.90 3.67 3.50 3.36 3.26 3.17 
3 25 TE 5:57 4.68 4.18 3.85 3.63 3.46 3.32 3.22 3.13 
26 7.72 5.53 4.64 4.14 3.82 3.59 3.42 3.29 3.18 3.09 

Oo 27 7.68 5.49 4.60 4.11 3.78 3.56 3.39 3.26 3.15 3.06 
f 28 7.64 5.45 4.57 4.07 3.75 3.53 3.36 3.23 3.12 3.03 
F 29 7.60 5.42 4.54 4.04 3.73 3.50 3.33 3.20 3.09 3.00 
r 30 7.56 5.39 4.51 4.02 3.70 3.47 3.30 3.17 3.07 2.98 
e 40 7.31 5.18 4.31 3.83 3.51 3.29 3.12 2.99 2.89 2.80 
e 60 7.08 4.98 4.13 3.65 3.34 3.12 2.95 2.82 2.12 2.63 
g 90 6.93 4.85 4.01 3.54 3.23 3.01 2.84 ule: 2.61 2.52 
m 120 6.85 4.79 3.95 3.48 3.17 2.96 2.79 2.66 2.56 2.47 
co 6.63 4.61 3.78 3.32 3.02 2.80 2.64 2.51 2.41 2.32 
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TABLE G.4 Critical Values of the Chi-Square Distribution 


Significance Level 
10 05 .01 
1 2.71 3.84 6.63 
2 4.61 5.99 9.21 
3 6.25 7.81 11.34 
4 7.78 9.49 13.28 
5 9.24 11.07 15.09 
6 10.64 12.59 16.81 
D 7 12.02 14.07 18.48 
6 8 13.36 15.51 20.09 
g 9 14.68 16.92 21.67 
r 10 15.99 18.31 23.21 
e 11 17.28 19.68 24.72 
e 12 18.55 21.03 26.22 
S 13 19.81 22.36 27.69 
A 14 21.06 23.68 29.14 
f 15 22.31 25.00 30.58 
16 23.54 26.30 32.00 
F i7 24.77 27.59 33.41 
r 18 25.99 28.87 34.81 
e 19 27.20 30.14 36.19 
e 20 28.41 31.41 37.57 
d 24 29.62 32.67 38.93 
a 22 30.81 33.92 40.29 
23 32.01 35.17 41.64 
24 33.20 36.42 42.98 
25 34.38 37.65 44.31 
26 35.56 38.89 45.64 
27 36.74 40.11 46.96 
28 37.92 41.34 48.28 
29 39.09 42.56 49.59 
30 40.26 43.77 50.89 


Example: The 5% critical value with df = 8 is 15.51. 
Source: This table was generated using the Stata® function invchi2tail. 
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Adjusted R-Squared: A goodness-of-fit measure in multiple 
regression analysis that penalizes additional explanatory 
variables by using a degrees of freedom adjustment in esti- 
mating the error variance. 

Alternative Hypothesis: The hypothesis against which the 
null hypothesis is tested. 

AR(1) Serial Correlation: The errors in a time series regres- 
sion model follow an AR(1) model. 

Asymptotic Bias: See inconsistency. 

Asymptotic Confidence Interval: A confidence interval that 
is approximately valid in large sample sizes. 

Asymptotic Normality: The sampling distribution of a prop- 
erly normalized estimator converges to the standard normal 
distribution. 

Asymptotic Properties: Properties of estimators and test 
statistics that apply when the sample size grows without 
bound. 

Asymptotic Standard Error: A standard error that is valid 
in large samples. 

Asymptotic ¢ Statistic: A ż statistic that has an approximate 
standard normal distribution in large samples. 

Asymptotic Variance: The square of the value by which we 
must divide an estimator in order to obtain an asymptotic 
standard normal distribution. 

Asymptotically Efficient: For consistent estimators with 
asymptotically normal distributions, the estimator with the 
smallest asymptotic variance. 

Asymptotically Uncorrelated: A time series process in 
which the correlation between random variables at two 
points in time tends to zero as the time interval between 
them increases. (See also weakly dependent.) 

Attenuation Bias: Bias in an estimator that is always toward 
zero; thus, the expected value of an estimator with attenua- 
tion bias is less in magnitude than the absolute value of the 
parameter. 

Augmented Dickey-Fuller Test: A test for a unit root that 
includes lagged changes of the variable as regressors. 


Autocorrelation: See serial correlation. 

Autoregressive Conditional Heteroskedasticity (ARCH): A 
model of dynamic heteroskedasticity where the variance of 
the error term, given past information, depends linearly on 
the past squared errors. 

Autoregressive Process of Order One [AR(1)]: A time 
series model whose current value depends linearly on its 
most recent value plus an unpredictable disturbance. 

Auxiliary Regression: A regression used to compute a test 
statistic—such as the test statistics for heteroskedasticity 
and serial correlation—or any other regression that does 
not estimate the model of primary interest. 

Average Marginal Effect: See average partial effect. 

Average Partial Effect (APE): For nonconstant partial effects, 
the partial effect averaged across the specified population. 

Average Causal Effect (ACE): See average treatment effect. 

Average Treatment Effect (ATE): A treatment, or policy, 
effect averaged across the population. 


Balanced Panel: A panel data set where all years (or periods) 
of data are available for all cross-sectional units. 

Base Group: The group represented by the overall inter- 
cept in a multiple regression model that includes dummy 
explanatory variables. 

Base Period: For index numbers, such as price or production 
indices, the period against which all other time periods are 
measured. 

Base Value: The value assigned to the base period for 
constructing an index number; usually the base value is 1 
or 100. 

Benchmark Group: See base group. 

Best Linear Unbiased Estimator (BLUE): Among all lin- 
ear unbiased estimators, the one with the smallest vari- 
ance. OLS is BLUE, conditional on the sample values 
of the explanatory variables, under the Gauss-Markov 
assumptions. 

Beta Coefficients: See standardized coefficients. 
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Bias: The difference between the expected value of an esti- 
mator and the population value that the estimator is sup- 
posed to be estimating. 

Biased Estimator: An estimator whose expectation, or sam- 
pling mean, is different from the population value it is sup- 
posed to be estimating. 

Biased Toward Zero: A description of an estimator whose 
expectation in absolute value is less than the absolute value 
of the population parameter. 

Binary (Dummy) Variable: See dummy variable. 

Binary Response Model: A model for a binary (dummy) 
dependent variable. 

Binary Variable: See dummy variable. 

Binomial Distribution: The probability distribution of the 
number of successes out of n independent Bernoulli trials, 
where each trial has the same probability of success. 

BLUE: See best linear unbiased estimator. 

Bootstrap: A resampling method that draws random samples, 
with replacement, from the original data set. 

Bootstrap Standard Error: A standard error obtained as the 
sample standard deviation of an estimate across all boot- 
strap samples. 

Breusch-Godfrey Test: An asymptotically justified test for 
AR(p) serial correlation, with AR(1) being the most popu- 
lar; the test allows for lagged dependent variables as well 
as other regressors that are not strictly exogenous. 

Breusch-Pagan Test for Heteroskedasticity (BP Test): 
Refer to Breusch-Pegan Test. 


Cc 


Causal Effect: A ceteris paribus change in one variable that 
has an effect on another variable. 

Causal (Treatment) Effect: The difference in outcomes 
between when an observation has a treatment (e.g. policy) 
and when it is not treated. 

Censored Normal Regression Model: The special case of the 
censored regression model where the underlying population 
model satisfies the classical linear model assumptions. 

Censored Regression Model: A multiple regression model 
where the dependent variable has been censored above or 
below some known threshold. 

Central Limit Theorem (CLT): A key result from prob- 
ability theory which implies that the sum of independent 
random variables, or even weakly dependent random vari- 
ables, when standardized by its standard deviation, has a 
distribution that tends to standard normal as the sample 
size grows. 

Ceteris Paribus: All other relevant factors are held fixed. 

Chi-Square Distribution: A probability distribution obtained 
by adding the squares of independent standard normal ran- 
dom variables. The number of terms in the sum equals the 
degrees of freedom in the distribution. 


Chi-Square Random Variable: A random variable with a 
chi-square distribution. 

Chow Statistic: An F statistic for testing the equality of 
regression parameters across different groups (say, men 
and women) or time periods (say, before and after a policy 
change). 

Classical Errors-in- Variables (CEV): A measurement error 
model where the observed measure equals the actual vari- 
able plus an independent, or at least an uncorrelated, mea- 
surement error. 

Classical Linear Model: The multiple linear regression model 
under the full set of classical linear model assumptions. 
Classical Linear Model (CLM) Assumptions: The ideal set 
of assumptions for multiple regression analysis: for cross- 
sectional analysis, Assumptions MLR.1 through MLR.6, 
and for time series analysis, Assumptions TS.1 through 
TS.6. The assumptions include linearity in the parameters, 
no perfect collinearity, the zero conditional mean assump- 
tion, homoskedasticity, no serial correlation, and normality 

of the errors. 

Cluster Effect: An unobserved effect that is common to all 
units, usually people, in the cluster. 

Cluster-Robust Standard Errors: Standard error estimates 
that allow for unrestricted forms of serial correlation and 
heteroskedasticity in panel data. These standard errors 
require a large cross section (N) and not too large time 
series (T). 

Cluster Sample: A sample of natural clusters or groups that 
usually consist of people. 

Clustering: The act of computing standard errors and test 
statistics that are robust to cluster correlation, either due to 
cluster sampling or to time series correlation in panel data. 

Cochrane-Orcutt (CO) Estimation: A method of estimat- 
ing a multiple linear regression model with AR(1) errors 
and strictly exogenous explanatory variables; unlike Prais- 
Winsten, Cochrane-Orcutt does not use the equation for the 
first time period. 

Coefficient of Determination: See R-squared. 

Cointegration: The notion that a linear combination of two 
series, each of which is integrated of order one, is inte- 
grated of order zero. 

Column Vector: A vector of numbers arranged as a column. 

Complete Cases Indicator: A dummy variable that is equal 
to 1 if and only if we have data for all variables for a par- 
ticular observation and 0 otherwise. 

Composite Error: Refer to Composite Error Term. 

Composite Error Term: In a panel data model, the sum of the 
time-constant unobserved effect and the idiosyncratic error. 

Conditional Distribution: The probability distribution of 
one random variable, given the values of one or more other 
random variables. 

Conditional Expectation: The expected or average value of 
one random variable, called the dependent or explained 


variable, that depends on the values of one or more other 
variables, called the independent or explanatory variables. 

Conditional Forecast: A forecast that assumes the future val- 
ues of some explanatory variables are known with certainty. 

Conditional Independence: When treatment and outcome 
variables can be considered to be independent of one 
another after conditioning on control variables. 

Conditional Median: The median of a response variable con- 
ditional on some explanatory variables. 

Conditional Variance: The variance of one random variable, 
given one or more other random variables. 

Confidence Interval (CI): A rule used to construct a random 
interval so that a certain percentage of all data sets, deter- 
mined by the confidence level, yields an interval that con- 
tains the population value. 

Consistency: An estimator converges in probability to the 
correct population value as the sample size grows. 

Consistent Estimator: An estimator that converges in proba- 
bility to the population parameter as the sample size grows 
without bound. 

Consistent Test: A test where, under the alternative hypoth- 
esis, the probability of rejecting the null hypothesis con- 
verges to one as the sample size grows without bound. 

Constant Elasticity Model: A model where the elasticity 
of the dependent variable, with respect to an explanatory 
variable, is constant; in multiple regression, both variables 
appear in logarithmic form. 

Contemporaneously Homoskedastic: Describes a time 
series or panel data applications in which the variance of 
the error term, conditional on the regressors in the same 
time period, is constant. 

Contemporaneously Exogenous: Describes a time series or 
panel data application in which a regressor is contempora- 
neously exogenous if it is uncorrelated with the error term 
in the same time period, although it may be correlated with 
the errors in other time periods. 

Continuous Random Variable: A random variable that takes 
on any particular value with probability zero. 

Control Group: In program evaluation, the group that does 
not participate in the program. 

Control Variable: See explanatory variable. 

Corner Solution Response: A nonnegative dependent vari- 
able that is roughly continuous over strictly positive values 
but takes on the value zero with some regularity. 

Correlated Random Effects: An approach to panel data 
analysis where the correlation between the unobserved 
effect and the explanatory variables is modeled, usually as 
a linear relationship. 

Correlation Coefficient: A measure of linear dependence 
between two random variables that does not depend on 
units of measurement and is bounded between —1 and 1. 

Count Variable: A variable that takes on nonnegative integer 
values. 
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Counterfactual Outcomes: The different outcomes that 
result from a counterfactual reasoning process. 

Counterfactual Reasonings: A method of policy evaluation 
in which we imagine an identical observation (individual, 
firm, country, etc.) under two different states of the world 
(e.g. with a policy and without a policy). 

Covariance: A measure of linear dependence between two 
random variables. 

Covariance Stationary: A time series process with constant 
mean and variance where the covariance between any two 
random variables in the sequence depends only on the dis- 
tance between them. 

Covariate: See explanatory variable. 

Critical Value: In hypothesis testing, the value against which 
a test statistic is compared to determine whether or not the 
null hypothesis is rejected. 

Cross-Sectional Data Set: A data set collected by sampling a 
population at a given point in time. 

Cumulative Distribution Function (cdf): A function that 
gives the probability of a random variable being less than 
or equal to any specified real number. 

Cumulative Effect: At any point in time, the change in a 
response variable after a permanent increase in an explanatory 
variable—usually in the context of distributed lag models. 


Data Frequency: The interval at which time series data are 
collected. Yearly, quarterly, and monthly are the most com- 
mon data frequencies. 

Data Mining: The practice of using the same data set to 
estimate numerous models in a search to find the “best” 
model. 

Davidson-MacKinnon Test: A test that is used for testing 
a model against a nonnested alternative; it can be imple- 
mented as a ż test on the fitted values from the competing 
model. 

Degrees of Freedom (df): In multiple regression analysis, 
the number of observations minus the number of estimated 
parameters. 

Denominator Degrees of Freedom: In an F test, the degrees 
of freedom in the unrestricted model. 

Dependent Variable: The variable to be explained in a mul- 
tiple regression model (and a variety of other models). 

Derivative: The slope of a smooth function, as defined using 
calculus. 

Descriptive Statistic: A statistic used to summarize a set of 
numbers; the sample average, sample median, and sample 
standard deviation are the most common. 

Deseasonalizing: The removing of the seasonal components 
from a monthly or quarterly time series. 

Detrending: The practice of removing the trend from a time 
series. 
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Diagonal Matrix: A matrix with zeros for all off-diagonal 
entries. 

Dickey-Fuller Distribution: The limiting distribution of the t 
statistic in testing the null hypothesis of a unit root. 

Dickey-Fuller (DF) Test: A t test of the unit root null hypothesis 
in an AR(1) model. (See also augmented Dickey-Fuller test.) 

Difference in Slopes: A description of a model where some 
slope parameters may differ by group or time period. 

Difference-in-Differences (DD or DID) Estimator: An esti- 
mator that arises in policy analysis with data for two time 
periods. One version of the estimator applies to indepen- 
dently pooled cross sections and another to panel data sets. 

Difference-in-Difference-in-Differences (DDD) Estimator: 
An estimator that allows for one additional control group 
than the standard difference-in-differences estimator. 
Useful in dealing with violations of the parallel trends 
assumption. 

Difference-Stationary Process: A time series sequence that 
is I(0) in its first differences. 

Discrete Random Variable: A random variable that takes on 
at most a finite or countably infinite number of values. 

Disturbance: See error term. 

Downward Bias: The expected value of an estimator is below 
the population value of the parameter. 

Dummy Dependent Variable: See binary response model. 

Dummy Variable: A variable that takes on the value zero or one. 

Dummy Variable Regression: In a panel data setting, the 
regression that includes a dummy variable for each cross- 
sectional unit, along with the remaining explanatory vari- 
ables. It produces the fixed effects estimator. 

Dummy Variable Trap: The mistake of including too many 
dummy variables among the independent variables; it 
occurs when an overall intercept is in the model and a 
dummy variable is included for each group. 

Duration Analysis: An application of the censored regres- 
sion model where the dependent variable is time elapsed 
until a certain event occurs, such as the time before an 
unemployed person becomes reemployed. 

Durbin-Watson (DW) Statistic: A statistic used to test 
for first order serial correlation in the errors of a time 
series regression model under the classical linear model 
assumptions. 

Dynamically Complete Model: A time series model where 
no further lags of either the dependent variable or the 
explanatory variables help to explain the mean of the 
dependent variable. 


Econometric Model: An equation relating the dependent 
variable to a set of explanatory variables and unobserved 
disturbances, where unknown population parameters deter- 
mine the ceteris paribus effect of each explanatory variable. 


Economic Model: A relationship derived from economic the- 
ory or less formal economic reasoning. 

Economic Significance: See practical significance. 

Elasticity: The percentage change in one variable given a 1% 
ceteris paribus increase in another variable. 

Empirical Analysis: A study that uses data in a formal 
econometric analysis to test a theory, estimate a relation- 
ship, or determine the effectiveness of a policy. 

Endogenous Explanatory Variable: An explanatory vari- 
able in a multiple regression model that is correlated with 
the error term, either because of an omitted variable, mea- 
surement error, or simultaneity. 

Endogenous Sample Selection: Nonrandom sample selec- 
tion where the selection is related to the dependent variable, 
either directly or through the error term in the equation. 

Endogenous Variables: In simultaneous equations mod- 
els, variables that are determined by the equations in the 
system. 

Engle-Granger Test: A test of the null hypothesis that two 
time series are not cointegrated; the statistic is obtained as 
the Dickey-Fuller statistic using OLS residuals. 

Engle-Granger Two-Step Procedure: A two-step method 
for estimating error correction models whereby the coin- 
tegrating parameter is estimated in the first stage, and the 
error correction parameters are estimated in the second. 

Error Correction Model: A time series model in first dif- 
ferences that also contains an error correction term, 
which works to bring two I(1) series back into long-run 
equilibrium. 

Error Term (Disturbance): The variable in a simple or mul- 
tiple regression equation that contains unobserved factors 
which affect the dependent variable. The error term may 
also include measurement errors in the observed dependent 
or independent variables. 

Error Variance: The variance of the error term in a multiple 
regression model. 

Errors-in-Variables: A situation where either the dependent 
variable or some independent variables are measured with 
error. 

Estimate: The numerical value taken on by an estimator for a 
particular sample of data. 

Estimator: A rule for combining data to produce a numerical 
value for a population parameter; the form of the rule does 
not depend on the particular sample obtained. 

Event Study: An econometric analysis of the effects of an 
event, such as a change in government regulation or eco- 
nomic policy, on an outcome variable. 

Excluding a Relevant Variable: In multiple regression anal- 
ysis, leaving out a variable that has a nonzero partial effect 
on the dependent variable. 

Exclusion Restrictions: Restrictions which state that certain 
variables are excluded from the model (or have zero popu- 
lation coefficients). 


Exogenous Explanatory Variable: An explanatory variable 
that is uncorrelated with the error term. 

Exogenous Sample Selection: A sample selection that either 
depends on exogenous explanatory variables or is indepen- 
dent of the error term in the equation of interest. 

Exogenous Variable: Any variable that is uncorrelated with 
the error term in the model of interest. 

Expected Value: A measure of central tendency in the distri- 
bution of a random variable, including an estimator. 

Experiment: In probability, a general term used to denote an 
event whose outcome is uncertain. In econometric analysis, 
it denotes a situation where data are collected by randomly 
assigning individuals to control and treatment groups. 

Experimental Data: Data that have been obtained by running 
a controlled experiment. 

Experimental Group: See treatment group. 

Explained Sum of Squares (SSE): The total sample varia- 
tion of the fitted values in a multiple regression model. 

Explained Variable: See dependent variable. 

Explanatory Variable: In regression analysis, a variable that 
is used to explain variation in the dependent variable. 

Exponential Function: A mathematical function defined for 
all values that has an increasing slope but a constant pro- 
portionate change. 

Exponential Smoothing: A simple method of forecasting a 
variable that involves a weighting of all previous outcomes 
on that variable. 

Exponential Trend: A trend with a constant growth rate. 


F 


F Distribution: The probability distribution obtained by form- 
ing the ratio of two independent chi-square random variables, 
where each has been divided by its degrees of freedom. 

F Random Variable: A random variable with an F 
distribution. 

F Statistic: A statistic used to test multiple hypotheses about 
the parameters in a multiple regression model. 

Falsification test: A method of testing the strict exogeneity 
assumption that includes a future value of a policy vari- 
able as a determinant of the current value of the outcome 
variable. 

Feasible GLS (FGLS) Estimator: A GLS procedure where 
variance or correlation parameters are unknown and there- 
fore must first be estimated. (See also generalized least 
squares estimator.) 

Finite Distributed Lag (FDL) Model: A dynamic model 
where one or more explanatory variables are allowed to 
have lagged effects on the dependent variable. 

First Difference: A transformation on a time series con- 
structed by taking the difference of adjacent time periods, 
where the earlier time period is subtracted from the later 
time period. 
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First-Differenced Equation: In time series or panel data 
models, an equation where the dependent and independent 
variables have all been first differenced. 

First-Differenced Estimator: In a panel data setting, the 
pooled OLS estimator applied to first differences of the 
data across time. 

First Order Autocorrelation: For a time series process 
ordered chronologically, the correlation coefficient 
between pairs of adjacent observations. 

First Order Conditions: The set of linear equations used to 
solve for the OLS estimates. 

First Stage: The first stage of a 2SLS procedure in which the 
endogenous explanatory variable is regressed on all instru- 
ments and exogenous explanatory variables. 

Fitted Values: The estimated values of the dependent vari- 
able when the values of the independent variables for each 
observation are plugged into the OLS regression line. 

Fixed Effect: See unobserved effect. 

Fixed Effects Estimator: For the unobserved effects panel 
data model, the estimator obtained by applying pooled 
OLS to a time-demeaned equation. 

Fixed Effects Model: An unobserved effects panel data 
model where the unobserved effects are allowed to be arbi- 
trarily correlated with the explanatory variables in each 
time period. 

Fixed Effects Transformation: For panel data, the time- 
demeaned data. 

Forecast Error: The difference between the actual outcome 
and the forecast of the outcome. 

Forecast Interval: In forecasting, a confidence interval for a 
yet unrealized future value of a time series variable. (See 
also prediction interval.) 

Frisch-Waugh Theorem: The general algebraic result that 
provides multiple regression analysis with its “partialling 
out” interpretation. 

Functional Form Misspecification: A problem that occurs 
when a model has omitted functions of the explanatory vari- 
ables (such as quadratics) or uses the wrong functions of 
either the dependent variable or some explanatory variables. 


G 


Gauss-Markov Assumptions: The set of assumptions 
(Assumptions MLR.1 through MLR.5 or TS.1 through 
TS.5) under which OLS is BLUE. 

Gauss-Markov Theorem: The theorem that states that, under 
the five Gauss-Markov assumptions (for cross-sectional or 
time series models), the OLS estimator is BLUE (conditional 
on the sample values of the explanatory variables). 

Generalized Least Squares (GLS) Estimator: An estimator 
that accounts for a known structure of the error variance 
(heteroskedasticity), serial correlation pattern in the errors, 
or both, via a transformation of the original model. 
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Geometric (or Koyck) Distributed Lag: An infinite distrib- 
uted lag model where the lag coefficients decline at a geo- 
metric rate. 

Granger Causality: A limited notion of causality where past 
values of one series (x,) are useful for predicting future val- 
ues of another series (y,), after past values of y, have been 
controlled for. 

Group-Specific: Time trends in panel data that are allowed 
to vary by group (as opposed to imposing a common time 
trend for all observations). 

Growth Rate: The proportionate change in a time series from 
the previous period. It may be approximated as the differ- 
ence in logs or reported in percentage form. 


H 


Heckit Method: An econometric procedure used to correct 
for sample selection bias due to incidental truncation or 
some other form of nonrandomly missing data. 

Heterogeneity Bias: The bias in OLS due to omitted hetero- 
geneity (or omitted variables). 

Heterogeneous trend model: A panel data model that allows 
the time trend to vary across individual observations. The 
model is estimated in first differences and requires at least 
three time periods of data. 

Heteroskedasticity: The variance of the error term, given the 
explanatory variables, is not constant. 

Heteroskedasticity and Autocorrelation Consistent (HAC) 
standard errors: A form of the OLS standard errors that 
is robust to both heteroskedasticity and serial correlation. 

Heteroskedasticity of Unknown Form: Heteroskedastic- 
ity that may depend on the explanatory variables in an 
unknown, arbitrary fashion. 

Heteroskedasticity-Robust F Statistic: An F-type statis- 
tic that is (asymptotically) robust to heteroskedasticity of 
unknown form. 

Heteroskedasticity-Robust LM Statistic: An LM statistic 
that is robust to heteroskedasticity of unknown form. 

Heteroskedasticity-Robust Standard Error: A standard 
error that is (asymptotically) robust to heteroskedasticity of 
unknown form. 

Heteroskedasticity-Robust ¢ Statistic: A ¢ statistic that is 
(asymptotically) robust to heteroskedasticity of unknown 
form. 

Highly Persistent: A time series process where outcomes 
in the distant future are highly correlated with current 
outcomes. 

Homoskedasticity: The errors in a regression model have 
constant variance conditional on the explanatory variables. 

Hypothesis Test: A statistical test of the null, or maintained, 
hypothesis against an alternative hypothesis. 


Idempotent Matrix: A (square) matrix where multiplication 
of the matrix by itself equals itself. 

Identification: A population parameter, or set of parameters, 
can be consistently estimated. 

Identified Equation: An equation whose parameters can be 
consistently estimated, especially in models with endog- 
enous explanatory variables. 

Identity Matrix: A square matrix where all diagonal ele- 
ments are one and all off-diagonal elements are zero. 

Idiosyncratic Error: In panel data models, the error that 
changes over time as well as across units (say, individuals, 
firms, or cities). 

Ignorable Assignment: See conditional independence. 

Impact Elasticity: In a distributed lag model, the immediate 
percentage change in the dependent variable given a 1% 
increase in the independent variable. 

Impact Multiplier: See impact propensity. 

Impact Propensity: In a distributed lag model, the immediate 
change in the dependent variable given a one-unit increase 
in the independent variable. 

Incidental Truncation: A sample selection problem whereby 
one variable, usually the dependent variable, is only 
observed for certain outcomes of another variable. 

Inclusion of an Irrelevant Variable: The including of an 
explanatory variable in a regression model that has a zero 
population parameter in estimating an equation by OLS. 

Inconsistency: The difference between the probability limit 
of an estimator and the parameter value. 

Inconsistent: Describes an estimator that does not converge 
(in probability) to the correct population parameter as the 
sample size grows. 

Independent Random Variables: Random variables 
whose joint distribution is the product of the marginal 
distributions. 

Independent Variable: See explanatory variable. 

Independently Pooled Cross Section: A data set obtained by 
pooling independent random samples from different points 
in time. 

Index Number: A statistic that aggregates information on 
economic activity, such as production or prices. 

Infinite Distributed Lag (IDL) Model: A distributed lag 
model where a change in the explanatory variable can have 
an impact on the dependent variable into the indefinite future. 

Influential Observations: See outliers. 

Information Set: In forecasting, the set of variables that we 
can observe prior to forming our forecast. 

In-Sample Criteria: Criteria for choosing forecasting models 
that are based on goodness-of-fit within the sample used to 
obtain the parameter estimates. 


Instrument: See instrumental variable. 

Instrument Exogeneity: In instrumental variables estima- 
tion, the requirement that an instrumental variable is uncor- 
related with the error term. 

Instrument Relevance: In instrumental variables estimation, 
the requirement that an instrumental variable helps to partially 
explain variation in the endogenous explanatory variable. 

Instrumental Variable: In an equation with an endogenous 
explanatory variable, an IV is a variable that does not 
appear in the equation, is uncorrelated with the error in the 
equation, and is (partially) correlated with the endogenous 
explanatory variable. 

Instrumental Variables (IV) Estimator: An estimator in a 
linear model used when instrumental variables are avail- 
able for one or more endogenous explanatory variables. 

Integrated of Order One [I(1)]: A time series process that 
needs to be first-differenced in order to produce an I(0) 
process. 

Integrated of Order Zero [I(0)]: A stationary, weakly 
dependent time series process that, when used in regression 
analysis, satisfies the law of large numbers and the central 
limit theorem. 

Interaction Effect: In multiple regression, the partial effect 
of one explanatory variable depends on the value of a dif- 
ferent explanatory variable. 

Interaction Term: An independent variable in a regression 
model that is the product of two explanatory variables. 

Intercept: In the equation of a line, the value of the y variable 
when the x variable is zero. 

Intercept Parameter: The parameter in a multiple lin- 
ear regression model that gives the expected value of the 
dependent variable when all the independent variables 
equal zero. 

Intercept Shift: The intercept in a regression model differs 
by group or time period. 

Internet: A global computer network that can be used to 
access information and download databases. 

Interval Estimator: A rule that uses data to obtain lower and 
upper bounds for a population parameter. (See also confi- 
dence interval.) 

Inverse: For an n x n matrix, its inverse (if it exists) is the 
n x n matrix for which pre- and post-multiplication by the 
original matrix yields the identity matrix. 

Inverse Mills Ratio: A term that can be added to a multiple 
regression model to remove sample selection bias. 


J 


Joint Distribution: The probability distribution determining 
the probabilities of outcomes involving two or more ran- 
dom variables. 
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Joint Hypotheses Test: A test involving more than one 
restriction on the parameters in a model. 

Jointly Insignificant: Failure to reject, using an F test at a 
specified significance level, that all coefficients for a group 
of explanatory variables are zero. 

Jointly Statistically Significant: The null hypothesis that 
two or more explanatory variables have zero population 
coefficients is rejected at the chosen significance level. 

Just Identified Equation: For models with endogenous 
explanatory variables, an equation that is identified but 
would not be identified with one fewer instrumental 
variable. 


K 


Kurtosis: A measure of the thickness of the tails of a distribu- 
tion based on the fourth moment of the standardized ran- 
dom variable; the measure is usually compared to the value 
for the standard normal distribution, which is three. 


L 


Lag Distribution: In a finite or infinite distributed lag model, 
the lag coefficients graphed as a function of the lag length. 

Lagged Dependent Variable: An explanatory variable that 
is equal to the dependent variable from an earlier time 
period. 

Lagged Endogenous Variable: In a simultaneous equations 
model, a lagged value of one of the endogenous variables. 

Lagrange Multiplier (LM) Statistic: A test statistic with 
large-sample justification that can be used to test for omit- 
ted variables, heteroskedasticity, and serial correlation, 
among other model specification problems. 

Large Sample Properties: See asymptotic properties. 

Latent Variable Model: A model where the observed depen- 
dent variable is assumed to be a function of an underlying 
latent, or unobserved, variable. 

Law of Iterated Expectations: A result from probability that 
relates unconditional and conditional expectations. 

Law of Large Numbers (LLN): A theorem that says that the 
average from a random sample converges in probability to 
the population average; the LLN also holds for stationary 
and weakly dependent time series. 

Leads and Lags Estimator: An estimator of a cointegrating 
parameter in a regression with I(1) variables, where the 
current, some past, and some future first differences in the 
explanatory variable are included as regressors. 

Least Absolute Deviations (LAD): A method for estimat- 
ing the parameters of a multiple regression model based 
on minimizing the sum of the absolute values of the 
residuals. 
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Least Squares Estimator: An estimator that minimizes a 
sum of squared residuals. 

Likelihood Ratio Statistic: A statistic that can be used to 
test single or multiple hypotheses when the constrained 
and unconstrained models have been estimated by maxi- 
mum likelihood. The statistic is twice the difference in the 
unconstrained and constrained log-likelihoods. 

Limited Dependent Variable (LDV): A dependent or 
response variable whose range is restricted in some impor- 
tant way. 

Linear Function: A function where the change in the depen- 
dent variable, given a one-unit change in an independent 
variable, is constant. 

Linear Probability Model (LPM): A binary response model 
where the response probability is linear in its parameters. 

Linear Time Trend: A trend that is a linear function of time. 

Linearly Independent Vectors: A set of vectors such that no 
vector can be written as a linear combination of the others 
in the set. 

Log Function: A mathematical function, defined only for strictly 
positive arguments, with a positive but decreasing slope. 

Logit Model: A model for binary response where the 
response probability is the logit function evaluated at a lin- 
ear function of the explanatory variables. 

Log-Likelihood Function: The sum of the log-likelihoods, 
where the log-likelihood for each observation is the log of 
the density of the dependent variable given the explanatory 
variables; the log-likelihood function is viewed as a func- 
tion of the parameters to be estimated. 

Longitudinal Data: See panel data. 

Long-Run Elasticity: The long-run propensity in a distrib- 
uted lag model with the dependent and independent vari- 
ables in logarithmic form; thus, the long-run elasticity is 
the eventual percentage increase in the explained variable, 
given a permanent 1% increase in the explanatory variable. 

Long-Run Multiplier: See long-run propensity. 

Long-Run Propensity (LRP): In a distributed lag model, the 
eventual change in the dependent variable given a perma- 
nent, one-unit increase in the independent variable. 

Loss Function: A function that measures the loss when a 
forecast differs from the actual outcome; the most common 
examples are absolute value loss and squared loss. 


Marginal Effect: The effect on the dependent variable that 
results from changing an independent variable by a small 
amount. 

Martingale: A time series process whose expected value, 
given all past outcomes on the series, simply equals the 
most recent value. 

Martingale Difference Sequence: The first difference of a 
martingale. It is unpredictable (or has a zero mean), given 
past values of the sequence. 


Matched Pair Sample: A sample where each observation is 
matched with another, as in a sample consisting of a hus- 
band and wife or a set of two siblings. 

Matrix: An array of numbers. 

Matrix Multiplication: An algorithm for multiplying 

together two conformable matrices. 

Matrix Notation: A convenient mathematical notation, 

grounded in matrix algebra, for expressing and manipulat- 

ing the multiple regression model. 

Maximum Likelihood Estimation (MLE): A broadly appli- 

cable estimation method where the parameter estimates are 

chosen to maximize the log-likelihood function. 

Maximum Likelihood Estimator: An estimator that maxi- 

mizes the (log of the) likelihood function. 

Mean Absolute Error (MAE): A performance measure in 

forecasting, computed as the average of the absolute values 

of the forecast errors. 

Mean Independent: The key requirement in multiple regres- 
sion analysis, which says the unobserved error has a mean 
that does not change across subsets of the population 
defined by different values of the explanatory variables. 

Mean Squared Error (MSE): The expected squared distance 

that an estimator is from the population value; it equals the 

variance plus the square of any bias. 

Measurement Error: The difference between an observed 

variable and the variable that belongs in a multiple regres- 

sion equation. 

Median: In a probability distribution, it is the value where 
there is a 50% chance of being below the value and a 50% 
chance of being above it. In a sample of numbers, it is the 
middle value after the numbers have been ordered. 

Micronumerosity: A term introduced by Arthur Goldberger 

to describe properties of econometric estimators with small 

sample sizes. 

Minimum Variance Unbiased Estimator: An estima- 

tor with the smallest variance in the class of all unbiased 


estimators. 

Missing at Random: In multiple regression analysis, a miss- 
ing data mechanism where the reason data are missing may 
be correlated with the explanatory variables but is indepen- 
dent of the error term. 

Missing Completely at Random (MCAR): In multiple 
regression analysis, a missing data mechanism where the 
reason data are missing is statistically independent of the 
values of the explanatory variables as well as the unob- 
served error. 

Missing Data: A data problem that occurs when we do not 
observe values on some variables for certain observations 
(individuals, cities, time periods, and so on) in the sample. 

Missing Indicator Method: A method for dealing with miss- 

ing observations in an explanatory variable. The explana- 

tory variable is included alongside a binary variable equal 
to 0 when the explanatory variable is missing for that 
observation, allowing us to use the full data set. 


Misspecification Analysis: The process of determining likely 
biases that can arise from omitted variables, measurement 
error, simultaneity, and other kinds of model misspecification. 

Moving Average Process of Order One [MA(1)]: A time 
series process generated as a linear function of the current 
value and one lagged value of a zero-mean, constant vari- 
ance, uncorrelated stochastic process. 

Multicollinearity: A term that refers to correlation among 
the independent variables in a multiple regression model; it 
is usually invoked when some correlations are “large,” but 
an actual magnitude is not well defined. 

Multiple Hypotheses Test: A test of a null hypothesis involv- 

ing more than one restriction on the parameters. 

Multiple Linear Regression (MLR) Model: A model linear 

in its parameters, where the dependent variable is a func- 

tion of independent variables plus an error term. 

Multiple Regression Analysis: A type of analysis that is 

used to describe estimation of and inference in the multiple 

linear regression model. 

Multiple Restrictions: More than one restriction on the 

parameters in an econometric model. 

Multiple-Step-Ahead Forecast: A time series forecast of 

more than one period into the future. 

Multiplicative Measurement Error: Measurement error 

where the observed variable is the product of the true unob- 

served variable and a positive measurement error. 

Multivariate Normal Distribution: A distribution for mul- 
tiple random variables where each linear combination of 
the random variables has a univariate (one-dimensional) 
normal distribution. 


n-R-Squared Statistic: See Lagrange multiplier statistic. 

Natural Experiment: A situation where the economic envi- 
ronment—sometimes summarized by an explanatory 
variable—exogenously changes, perhaps inadvertently, due 
to a policy or institutional change. 

Natural Logarithm: See logarithmic function. 

Newey-West Standards Errors: A specific form of HAC 
standard errors. In this case, the truncation lag is set to the 
integer part of 4(n/100) superscript 2/9. Refer to file sub- 
mitted with gloss file. 

Nonexperimental Data: Data that have not been obtained 
through a controlled experiment. 

Nonlinear Function: A function whose slope is not constant. 

Nonnested Models: Two (or more) models where no model 
can be written as a special case of the other by imposing 
restrictions on the parameters. 

Nonrandom Sample: A sample obtained other than by sam- 
pling randomly from the population of interest. 

Nonrandom Sample Selection: When the sample is not ran- 
domly drawn from the population, but is selected on the 
basis of individual characteristics. 
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Nonstationary Process: A time series process whose joint 
distributions are not constant across different epochs. 
Normal Distribution: A probability distribution commonly 
used in statistics and econometrics for modeling a popula- 
tion. Its probability distribution function has a bell shape. 
Normality Assumption: The classical linear model assump- 
tion which states that the error (or dependent variable) has a 
normal distribution, conditional on the explanatory variables. 
Null Hypothesis: In classical hypothesis testing, we take this 
hypothesis as true and require the data to provide substan- 
tial evidence against it. 

Numerator Degrees of Freedom: In an F test, the number of 
restrictions being tested. 


O 


Observational Data: See nonexperimental data. 

OLS: See ordinary least squares. 

OLS Intercept Estimate: The intercept in an OLS regression 
line. 

OLS Regression Line: The equation relating the predicted 
value of the dependent variable to the independent vari- 
ables, where the parameter estimates have been obtained 
by OLS. 

OLS Slope Estimate: A slope in an OLS regression line. 

Omitted Variable Bias: The bias that arises in the OLS 
estimators when a relevant variable is omitted from the 


regression. 

Omitted Variables: One or more variables, which we would 
like to control for, have been omitted in estimating a regres- 
sion model. 

One-Sided Alternative: An alternative hypothesis that states 
that the parameter is greater than (or less than) the value 
hypothesized under the null. 

One-Step-Ahead Forecast: A time series forecast one period 
into the future. 

One-Tailed Test: A hypothesis test against a one-sided 
alternative. 

Online Databases: Databases that can be accessed via a com- 
puter network. 

Online Search Services: Computer software that allows the 
Internet or databases on the Internet to be searched by 
topic, name, title, or keywords. 

Order Condition: A necessary condition for identifying the 
parameters in a model with one or more endogenous explana- 
tory variables: the total number of exogenous variables must 
be at least as great as the total number of explanatory variables. 

Ordinal Variable: A variable where the ordering of the val- 
ues conveys information but the magnitude of the values 
does not. 

Ordinary Least Squares (OLS): A method for estimating 
the parameters of a multiple linear regression model. The 
ordinary least squares estimates are obtained by minimiz- 
ing the sum of squared residuals. 
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Outliers: Observations in a data set that are substantially dif- 
ferent from the bulk of the data, perhaps because of errors 
or because some data are generated by a different model 
than most of the other data. 

Out-of-Sample Criteria: Criteria used for choosing forecast- 
ing models which are based on a part of the sample that 
was not used in obtaining parameter estimates. 

Over Controlling: In a multiple regression model, including 
explanatory variables that should not be held fixed when 
studying the ceteris paribus effect of one or more other 
explanatory variables; this can occur when variables that 
are themselves outcomes of an intervention or a policy are 
included among the regressors. 

Overall Significance of a Regression: A test of the joint sig- 
nificance of all explanatory variables appearing in a mul- 
tiple regression equation. 

Overdispersion: In modeling a count variable, the variance is 
larger than the mean. 

Overidentified Equation: In models with endogenous 
explanatory variables, an equation where the number of 
instrumental variables is strictly greater than the number of 
endogenous explanatory variables. 

Overidentifying Restrictions: The extra moment conditions 
that come from having more instrumental variables than 
endogenous explanatory variables in a linear model. 

Overspecifying the Model: See inclusion of an irrelevant 
variable. 


P 


p-Value: The smallest significance level at which the null 
hypothesis can be rejected. Equivalently, the largest signifi- 
cance level at which the null hypothesis cannot be rejected. 

Pairwise Uncorrelated Random Variables: A set of two or 
more random variables where each pair is uncorrelated. 

Panel Data: A data set constructed from repeated cross sec- 
tions over time. With a balanced panel, the same units 
appear in each time period. With an unbalanced panel, some 
units do not appear in each time period, often due to attrition. 

Parallel Trends Assumption: The assumption that any trends 
in the outcome variable would trend at the same rate and 
direction between the treatment and control groups in the 
absence of the treatment. 

Partial Derivative: For a smooth function of more than one 
variable, the slope of the function in one direction. 

Partial Effect: The effect of an explanatory variable on the 
dependent variable, holding other factors in the regression 
model fixed. 

Partial Effect at the Average (PEA): In models with non- 
constant partial effects, the partial effect evaluated at the 
average values of the explanatory variables. 


Percent Correctly Predicted: In a binary response model, 
the percentage of times the prediction of zero or one coin- 
cides with the actual outcome. 

Percentage Change: The proportionate change in a variable, 
multiplied by 100. 

Percentage Point Change: The change in a variable that is 
measured as a percentage. 

Perfect Collinearity: In multiple regression, one independent 
variable is an exact linear function of one or more other 
independent variables. 

Plug-In Solution to the Omitted Variables Problem: A 
proxy variable is substituted for an unobserved omitted 
variable in an OLS regression. 

Point Forecast: The forecasted value of a future outcome. 

Poisson Distribution: A probability distribution for count 
variables. 

Poisson Regression Model: A model for a count dependent 
variable where the dependent variable, conditional on the 
explanatory variables, is nominally assumed to have a Pois- 
son distribution. 

Policy Analysis: An empirical analysis that uses econometric 
methods to evaluate the effects of a certain policy. 

Pooled Cross Section: A data configuration where indepen- 
dent cross sections, usually collected at different points in 
time, are combined to produce a single data set. 

Population Model: A model, especially a multiple linear 
regression model, that describes a population. 

Population R-Squared: In the population, the fraction of the 
variation in the dependent variable that is explained by the 
explanatory variables. 

Population Regression Function: See conditional expectation. 

Positive Definite: A symmetric matrix such that all quadratic 
forms, except the trivial one that must be zero, are strictly 
positive. 

Positive Semi-Definite: A symmetric matrix such that all 
quadratic forms are nonnegative. 

Power of a Test: The probability of rejecting the null hypoth- 
esis when it is false; the power depends on the values of the 
population parameters under the alternative. 

Practical Significance: The practical or economic impor- 
tance of an estimate, which is measured by its sign and 
magnitude, as opposed to its statistical significance. 

Prais-Winsten (PW) Estimation: A method of estimating 
a multiple linear regression model with AR(1) errors and 
strictly exogenous explanatory variables; unlike Cochrane- 
Orcutt, Prais-Winsten uses the equation for the first time 
period in estimation. 

Predetermined Variable: In a simultaneous equations model, 
either a lagged endogenous variable or a lagged exogenous 
variable. 

Predicted Variable: See dependent variable. 


Prediction: The estimate of an outcome obtained by plugging 
specific values of the explanatory variables into an esti- 
mated model, usually a multiple regression model. 

Prediction Error: The difference between the actual out- 
come and a prediction of that outcome. 

Prediction Interval: A confidence interval for an unknown 
outcome on a dependent variable in a multiple regression 
model. 

Predictor Variable: See explanatory variable. 

Probability Density Function (pdf): A function that, for dis- 
crete random variables, gives the probability that the ran- 
dom variable takes on each value; for continuous random 
variables, the area under the pdf gives the probability of 
various events. 

Probability Limit: The value to which an estimator con- 
verges as the sample size grows without bound. 

Probit Model: A model for binary responses where the 
response probability is the standard normal cdf evaluated at 
a linear function of the explanatory variables. 

Program Evaluation: An analysis of a particular private or 
public program using econometric methods to obtain the 
causal effect of the program. 

Proportionate Change: The change in a variable relative to 
its initial value; mathematically, the change divided by the 
initial value. 

Proxy Variable: An observed variable that is related but not 
identical to an unobserved explanatory variable in multiple 
regression analysis. 

Pseudo R-Squared: Any number of goodness-of-fit measures 
for limited dependent variable models. 


Q 


Quadratic Form: A mathematical function where the vector 
argument both pre- and post-multiplies a square, symmet- 
ric matrix. 

Quadratic Functions: Functions that contain squares of one 
or more explanatory variables; they capture diminishing or 
increasing effects on the dependent variable. 

Quasi-Demeaned Data: In random effects estimation for 
panel data, it is the original data in each time period minus 
a fraction of the time average; these calculations are done 
for each cross-sectional observation. 

Quasi-Differenced Data: In estimating a regression model with 
AR(1) serial correlation, it is the difference between the cur- 
rent time period and a multiple of the previous time period, 
where the multiple is the parameter in the AR(1) model. 

Quasi-Experiment: See natural experiment. 

Quasi-Likelihood Ratio Statistic: A modification of the 
likelihood ratio statistic that accounts for possible distribu- 
tional misspecification, as in a Poisson regression model. 
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Quasi-Maximum Likelihood Estimation (QMLE): Maxi- 
mum likelihood estimation where the log-likelihood 
function may not correspond to the actual conditional dis- 
tribution of the dependent variable. 


Random Assignment: The process by which observations 
are assigned to the treatment and control groups com- 
pletely at random (i.e. not as a function of any observable 
characteristics) 

Randomized Controlled Trial (RCT): An experimental 
design in which a treatment group (given a policy) and 
a control group (no policy) are randomly selected from 
the population. Assuming no pre-treatment differences 
between these groups, any observed differences should be a 
result of the policy given to the treatment group. 

Regression adjustment: A method for dealing with non- 
random assignment that involves additional control 
variables. The inclusion of these variables allows for 
identification of the causal effect of a policy. 

R-Squared: In a multiple regression model, the proportion of 
the total sample variation in the dependent variable that is 
explained by the independent variable. 

R-Squared Form of the F Statistic: The F statistic for testing 
exclusion restrictions expressed in terms of the R-squareds 
from the restricted and unrestricted models. 

Random Coefficient (Slope) Model: A multiple regression 
model where the slope parameters are allowed to depend 
on unobserved unit-specific variables. 

Random Effects Estimator: A feasible GLS estimator in the 
unobserved effects model where the unobserved effect is 
assumed to be uncorrelated with the explanatory variables 
in each time period. 

Random Effects Model: The unobserved effects panel 
data model where the unobserved effect is assumed to be 
uncorrelated with the explanatory variables in each time 
period. 

Random Sample: A sample obtained by sampling randomly 
from the specified population. 

Random Sampling: A sampling scheme whereby each 
observation is drawn at random from the population. 
In particular, no unit is more likely to be selected than any 
other unit, and each draw is independent of all other draws. 

Random Variable: A variable whose outcome is uncertain. 

Random Vector: A vector consisting of random variables. 

Random Walk: A time series process where next period’s 
value is obtained as this period’s value, plus an indepen- 
dent (or at least an uncorrelated) error term. 

Random Walk with Drift: A random walk that has a con- 
stant (or drift) added in each period. 
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Rank Condition: A sufficient condition for identification 
of a model with one or more endogenous explanatory 
variables. 

Rank of a Matrix: The number of linearly independent col- 
umns in a matrix. 

Rational Distributed Lag (RDL) Model: A type of infinite 
distributed lag model where the lag distribution depends on 
relatively few parameters. 

Reduced Form Equation: A linear equation where an 
endogenous variable is a function of exogenous variables 
and unobserved errors. 

Reduced Form Error: The error term appearing in a reduced 
form equation. 

Reduced Form Parameters: The parameters appearing in a 
reduced form equation. 

Regressand: See dependent variable. 

Regression Specification Error Test (RESET): A general 
test for functional form in a multiple regression model; it 
is an F test of joint significance of the squares, cubes, and 
perhaps higher powers of the fitted values from the initial 
OLS estimation. 

Regression through the Origin: Regression analysis where 
the intercept is set to zero; the slopes are obtained by mini- 
mizing the sum of squared residuals, as usual. 

Regressor: See explanatory variable. 

Rejection Region: The set of values of a test statistic that 
leads to rejecting the null hypothesis. 

Rejection Rule: In hypothesis testing, the rule that deter- 
mines when the null hypothesis is rejected in favor of the 
alternative hypothesis. 

Relative Change: See proportionate change. 

Resampling Method: A technique for approximating stan- 
dard errors (and distributions of test statistics) whereby a 
series of samples are obtained from the original data set 
and estimates are computed for each subsample. 

Residual: The difference between the actual value and the fit- 
ted (or predicted) value; there is a residual for each obser- 
vation in the sample used to obtain an OLS regression 
line. 

Residual Analysis: A type of analysis that studies the sign 
and size of residuals for particular observations after a mul- 
tiple regression model has been estimated. 

Residual Sum of Squares: See sum of squared residuals. 

Response Probability: In a binary response model, the prob- 
ability that the dependent variable takes on the value one, 
conditional on explanatory variables. 

Response Variable: See dependent variable. 

Restricted Model: In hypothesis testing, the model obtained 
after imposing all of the restrictions required under the 
null. 

Retrospective Data: Data collected based on past, rather than 
current, information. 


Root Mean Squared Error (RMSE): Another name for 
the standard error of the regression in multiple regression 
analysis. 

Row Vector: A vector of numbers arranged as a row. 


S 


Sample Average: The sum of n numbers divided by n; a mea- 
sure of central tendency. 

Sample Correlation Coefficient: An estimate of the (popula- 
tion) correlation coefficient from a sample of data. 

Sample Covariance: An unbiased estimator of the popula- 
tion covariance between two random variables. 

Sample Regression Function (SRF): See OLS regression 
line. 

Sample Standard Deviation: A consistent estimator of the 
population standard deviation. 

Sample Variance: An unbiased, consistent estimator of the 
population variance. 

Sampling Distribution: The probability distribution of an 
estimator over all possible sample outcomes. 

Sampling Standard Deviation: The standard deviation of 
an estimator, that is, the standard deviation of a sampling 
distribution. 

Sampling Variance: The variance in the sampling distribu- 
tion of an estimator; it measures the spread in the sampling 
distribution. 

Scalar Multiplication: The algorithm for multiplying a sca- 
lar (number) by a vector or matrix. 

Scalar Variance-Covariance Matrix: A variance-covariance 
matrix where all off-diagonal terms are zero and the diago- 
nal terms are the same positive constant. 

Score Statistic: See Lagrange multiplier statistic. 

Seasonal Dummy Variables: A set of dummy variables used 
to denote the quarters or months of the year. 

Seasonality: A feature of monthly or quarterly time series 
where the average value differs systematically by season of 
the year. 

Seasonally Adjusted: Monthly or quarterly time series data 
where some statistical procedure—possibly regression on 
seasonal dummy variables—has been used to remove the 
seasonal component. 

Selected Sample: A sample of data obtained not by random 
sampling but by selecting on the basis of some observed or 
unobserved characteristic. 

Self-Selection: Deciding on an action based on the likely 
benefits, or costs, of taking that action. 

Self-Selection Problem: Occurs when there is non-random 
assignment and inclusion in the treatment and control 
group systematically depends on individual characteristics. 

Semi-Elasticity: The percentage change in the dependent vari- 
able given a one-unit increase in an independent variable. 


Sensitivity Analysis: The process of checking whether 
the estimated effects and statistical significance of key 
explanatory variables are sensitive to inclusion of other 
explanatory variables, functional form, dropping of 
potentially outlying observations, or different methods of 
estimation. 

Sequentially Exogenous: A feature of an explanatory vari- 
able in time series (or panel data) models where the error 
term in the current time period has a zero mean conditional 
on all current and past explanatory variables; a weaker ver- 
sion is stated in terms of zero correlations. 

Serial Correlation: In a time series or panel data model, cor- 
relation between the errors in different time periods. 

Serial Correlation-Robust Standard Error: A standard 
error for an estimator that is (asymptotically) valid whether 
or not the errors in the model are serially correlated. 

Serially Uncorrelated: The errors in a time series or panel 
data model are pairwise uncorrelated across time. 

Short-Run Elasticity: The impact propensity in a distributed 
lag model when the dependent and independent variables 
are in logarithmic form. 

Significance Level: The probability of a Type I error in 
hypothesis testing. 

Simple Linear Regression Model: A model where the 
dependent variable is a linear function of a single indepen- 
dent variable, plus an error term. 

Simultaneity: A term that means at least one explanatory 
variable in a multiple linear regression model is determined 
jointly with the dependent variable. 

Simultaneity Bias: The bias that arises from using OLS to 
estimate an equation in a simultaneous equations model. 
Simultaneous Equations Model (SEM): A model that 
jointly determines two or more endogenous variables, 
where each endogenous variable can be a function of other 
endogenous variables as well as of exogenous variables 

and an error term. 

Skewness: A measure of how far a distribution is from being 
symmetric, based on the third moment of the standardized 
random variable. 

Slope Parameter: The coefficient on an independent variable 
in a multiple regression model. 

Smearing Estimate: A retransformation method particularly 
useful for predicting the level of a response variable when 
a linear model has been estimated for the natural log of the 
response variable. 

Spreadsheet: Computer software used for entering and 
manipulating data. 

Spurious Regression Problem: A problem that arises when 
regression analysis indicates a relationship between two or 
more unrelated time series processes simply because each 
has a trend, is an integrated time series (such as a random 
walk), or both. 
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Stable AR(1) Process: An AR(1) process where the param- 
eter on the lag is less than one in absolute value. The cor- 
relation between two random variables in the sequence 
declines to zero at a geometric rate as the distance between 
the random variables increases, and so a stable AR(1) pro- 
cess is weakly dependent. 

Standard Deviation: A common measure of spread in the 
distribution of a random variable. 

Standard Deviation of B;: A common measure of spread 
in the sampling distribution of f;. 

Standard Error of 6: The standard error of the OLS 
slope estimator. In the simple regression model, this is 
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Standard Error of B;: An estimate of the standard devia- 
tion in the sampling distribution of B;. 

Standard Error of the Regression (SER): In multiple 
regression analysis, the estimate of the standard devia- 
tion of the population error, obtained as the square 
root of the sum of squared residuals over the degrees of 
freedom. 

Standardized Coefficients: Regression coefficients that 
measure the standard deviation change in the dependent 
variable given a one standard deviation increase in an inde- 
pendent variable. 

Standardized Random Variable: A random variable trans- 
formed by subtracting off its expected value and dividing 
the result by its standard deviation; the new random vari- 
able has mean zero and standard deviation one. 

Static Model: A time series model where only contempora- 
neous explanatory variables affect the dependent variable. 

Stationary Process: A time series process where the 
marginal and all joint distributions are invariant across 
time. 

Statistical Significance: The importance of an estimate 
as measured by the size of a test statistic, usually a t 
statistic. 

Statistically Insignificant: Failure to reject the null hypoth- 
esis that a population parameter is equal to zero, at the cho- 
sen significance level. 

Statistically Significant: Rejecting the null hypothesis that a 
parameter is equal to zero against the specified alternative, 
at the chosen significance level. 

Stochastic Process: A sequence of random variables indexed 
by time. 

Stratified Sampling: A nonrandom sampling scheme 
whereby the population is first divided into several non- 
overlapping, exhaustive strata, and then random samples 
are taken from within each stratum. 

Strict Exogeneity: An assumption that holds in a time series 
or panel data model when the explanatory variables are 
strictly exogenous. 
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Strictly Exogenous: A feature of explanatory variables in 
a time series or panel data model where the error term at 
any time period has zero expectation, conditional on the 
explanatory variables in all time periods; a less restrictive 
version is stated in terms of zero correlations. 

Strongly Dependent: See highly persistent. 

Structural Equation: An equation derived from economic 
theory or from less formal economic reasoning. 

Structural Error: The error term in a structural equation, 
which could be one equation in a simultaneous equations 
model. 

Structural Parameters: The parameters appearing in a struc- 
tural equation. 

Studentized Residuals: The residuals computed by exclud- 
ing each observation, in turn, from the estimation, divided 
by the estimated standard deviation of the error. 

Sum of Squared Residuals (SSR): In multiple regression 
analysis, the sum of the squared OLS residuals across all 
observations. 

Summation Operator: A notation, denoted by }, used to 
define the summing of a set of numbers. 

Symmetric Distribution: A probability distribution charac- 
terized by a probability density function that is symmet- 
ric around its median value, which must also be the mean 
value (whenever the mean exists). 

Symmetric Matrix: A (square) matrix that equals its transpose. 


T 


t Distribution: The distribution of the ratio of a standard nor- 
mal random variable and the square root of an independent 
chi-square random variable, where the chi-square random 
variable is first divided by its df: 

t Ratio: See t statistic. 

t Statistic: The statistic used to test a single hypothesis about 
the parameters in an econometric model. 

Test Statistic: A rule used for testing hypotheses where each 
sample outcome produces a numerical value. 

Text Editor: Computer software that can be used to edit text 
files. 

Text (ASCII) File: A universal file format that can be trans- 
ported across numerous computer platforms. 

Time-Demeaned Data: Panel data where, for each cross- 
sectional unit, the average over time is subtracted from the 
data in each time period. 

Time Series Data: Data collected over time on one or more 
variables. 

Time Series Process: See stochastic process. 

Time Trend: A function of time that is the expected value of 
a trending time series process. 


Tobit Model: A model for a dependent variable that takes on 
the value zero with positive probability but is roughly con- 
tinuously distributed over strictly positive values. (See also 
corner solution response.) 

Top Coding: A form of data censoring where the value of a 
variable is not reported when it is above a given threshold; 
we only know that it is at least as large as the threshold. 

Total Sum of Squares (SST): The total sample variation in a 
dependent variable about its sample average. 

Trace of a Matrix: For a square matrix, the sum of its diago- 
nal elements. 

Transpose: For any matrix, the new matrix obtained by inter- 
changing its rows and columns. 

Treatment Group: In program evaluation, the group that par- 
ticipates in the program. 

Trend-Stationary Process: A process that is stationary once 
a time trend has been removed; it is usually implicit that the 
detrended series is weakly dependent. 

True Model: The actual population model relating the depen- 
dent variable to the relevant independent variables, plus a 
disturbance, where the zero conditional mean assumption 
holds. 

Truncated Normal Regression Model: The special case 
of the truncated regression model where the underly- 
ing population model satisfies the classical linear model 
assumptions. 

Truncated Regression Model: A linear regression model for 
cross-sectional data in which the sampling scheme entirely 
excludes, on the basis of outcomes on the dependent vari- 
able, part of the population. 

Truncation Lag: A parameter in HAC standard errors that 
determines the number of lags of the residuals that need to 
be included to correct for serial correlation. 

Two-Sided Alternative: An alternative where the population 
parameter can be either less than or greater than the value 
stated under the null hypothesis. 

Two Stage Least Squares (2SLS) Estimator: An instru- 
mental variables estimator where the IV for an endog- 
enous explanatory variable is obtained as the fitted value 
from regressing the endogenous explanatory variable on all 
exogenous variables. 

Two-Tailed Test: A test against a two-sided alternative. 

Type I Error: A rejection of the null hypothesis when it is 
true. 

Type H Error: The failure to reject the null hypothesis when 
it is false. 


U 


Unbalanced Panel: A panel data set where certain years (or 
periods) of data are missing for some cross-sectional units. 


Unbiased Estimator: An estimator whose expected value (or 
mean of its sampling distribution) equals the population 
value (regardless of the population value). 

ncentered R-squared: The R-squared computed without 
subtracting the sample average of the dependent variable 
when obtaining the total sum of squares (SST). 

Unconditional Forecast: A forecast that does not rely on 

knowing, or assuming values for, future explanatory 

variables. 

Uncorrelated Random Variables: Random variables that are 
not linearly related. 

nderspecifying a Model: See excluding a relevant variable. 

nidentified Equation: An equation with one or more 
endogenous explanatory variables where sufficient instru- 
mental variables do not exist to identify the parameters. 

nconfounded Assignment: See conditional independence. 

nit Root Process: A highly persistent time series process 
where the current value equals last period’s value, plus a 
weakly dependent disturbance. 

Unobserved Effect: In a panel data model, an unobserved 
variable in the error term that does not change over time. 
For cluster samples, an unobserved variable that is com- 
mon to all units in the cluster. 

Unit Roots: In the time series process y, = a + py,-, + Up 

we Say that y, has a unit root if p = 1. This is also known as 

a random walk and is an unpredictable process. 

Unobserved Effects Model: A model for panel data or clus- 
ter samples where the error term contains an unobserved 
effect. 

nobserved Heterogeneity: See unobserved effect. 

nrestricted Model: In hypothesis testing, the model that 
has no restrictions placed on its parameters. 

pward Bias: The expected value of an estimator is greater 
than the population parameter value. 


vV 


Variance-Covariance Matrix: For a random vector, the posi- 
tive semi-definite matrix defined by putting the variances 
down the diagonal and the covariances in the appropriate 
off-diagonal entries. 

Variance-Covariance Matrix of the OLS Estimator: The 
matrix of sampling variances and covariances for the vector 
of OLS coefficients. 

Variance Inflation Factor: In multiple regression analysis 
under the Gauss-Markov assumptions, the term in the sam- 
pling variance affected by correlation among the explana- 
tory variables. 
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Variance of the Prediction Error: The variance in the error 
that arises when predicting a future value of the depen- 
dent variable based on an estimated multiple regression 
equation. 

Vector Autoregressive (VAR) Model: A model for two or 
more time series where each variable is modeled as a linear 
function of past values of all variables, plus disturbances 
that have zero means given all past values of the observed 
variables. 


W 


Wald Statistic: A general test statistic for testing hypotheses 
in a variety of econometric settings; typically, the Wald sta- 
tistic has an asymptotic chi-square distribution. 

Weak Instruments: Instrumental variables that are only 
slightly correlated with the relevant endogenous explana- 
tory variable or variables. 

Weakly Dependent: A term that describes a time series process 
where some measure of dependence between random vari- 
ables at two points in time—such as correlation—diminishes 
as the interval between the two points in time increases. 

Weighted Least Squares: Refer to below. 

Weighted Least Squares (WLS) Estimator: An estima- 
tor used to adjust for a known form of heteroskedasticity, 
where each squared residual is weighted by the inverse of 
the (estimated) variance of the error. 

White Test for Heteroskedasticity: A test for heteroskedas- 
ticity in which the squared residuals are regressed on linear 
and non-linear functions of the explanatory variables. 

Within Estimator: See fixed effects estimator. 

Within Transformation: See fixed effects transformation. 


Y 


Year Dummy Variables: For data sets with a time series 
component, dummy (binary) variables equal to one in the 
relevant year and zero in all other years. 


Z 


Zero Conditional Mean Assumption: A key assumption 
used in multiple regression analysis that states that, given 
any values of the explanatory variables, the expected value 
of the error equals zero. (See Assumptions MLR.4, TS.3, 
and TS.3’ in the text.) 

Zero Matrix: A matrix where all entries are zero. 

Zero-One Variable: See dummy variable. 


Numbers 


2SLS. See two stage least squares 
401(k) plans 
asymptotic normality, 169-170 
comparison of simple and multiple regression estimates, 76 
statistical vs. practical significance, 133 
WLS estimation, 277 


A 


ability and wage 
causality, 12 
excluding ability from model, 84—89 
IV for ability, 515 
mean independent, 23 
proxy variable for ability, 299-306 
adaptive expectations, 375, 377 
adjusted R-squareds, 196-199, 396 
advantages of multiple over simple regression, 66-70 
AFDC participation, 249 
age 
financial wealth and, 276-278, 282 
smoking and, 280-281 
aggregate consumption function, 547, 548 
air pollution and housing prices 
beta coefficients, 190-191 
logarithmic forms, 186-188 
quadratic functions, 190-192 
t test, 130 
alcohol drinking, 246 
alternative hypotheses 
defined, 734 
one-sided, 122-126, 735 
two-sided, 126-127, 735 
antidumping filings and chemical imports 
AR(3) serial correlation, 407 
dummy variables, 349-350 
forecasting, 632, 633 
PW estimation, 410 
seasonality, 358-360 
apples, ecolabeled, 195-196 
ARCH model, 417-418 
AR(2) models 
EMH example, 374 
forecasting example, 374 
AR(1) models, consistency example, 372-373 
testing for, after 2SLS estimation, 520 
arrests 
asymptotic normality, 169-170 
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average sentence length and, 268 
goodness-of-fit, 78 
heteroskedasticity-robust LM statistic, 268 
linear probability model, 243 
normality assumption and, 119 
Poisson regression, 581-582 
AR(1) serial correlation 
correcting for, 407-414 
testing for, 402-407 
AR(q) serial correlation 
correcting for, 413-414 
testing for, 406-407 
ASCII files, 646 
assumptions 
classical linear model (CLM), 118 
establishing unbiasedness of OLS, 79-83, 339-342 
homoskedasticity, 45-48, 88-89, 95, 385 
matrix notation, 763-766 
for multiple linear regressions, 79-83, 88, 95, 166 
normality, 117—120 
for simple linear regressions, 40-48 
for time series regressions, 339-345, 370-376, 385 
zero mean and zero correlation, 166 
asymptotically uncorrelated sequences, 346-348, 368-370 
asymptotic bias, deriving, 167-168 
asymptotic confidence interval, 171 
asymptotic efficiency of OLS, 175-176 
asymptotic normality of estimators, in general, 723-724 
asymptotic normality of OLS 
for multiple linear regressions, 170-172 
for time series regressions, 373-376 
asymptotic properties. See large sample properties 
asymptotic sample properties of estimators, 721-724 
asymptotics, OLS. See OLS asymptotics 
asymptotic standard errors, 171 
asymptotic ¢ statistics, 171 
asymptotic variance, 170 
attenuation bias, 311, 312 
attrition, 469 
augmented Dickey-Fuller test, 612 
autocorrelation, 342-344. See also serial correlation 
autoregressive conditional heteroskedacity (ARCH) 
model, 417-418 
autoregressive model of order two [AR(2)]. See AR(2) models 
autoregressive process of order one [AR(1)], 369 
auxiliary regression, 173 
average marginal effect (AME), 306, 566 
average partial effect (APE), 306, 566, 575 
average treatment effect (ATE), 53, 435 
average, using summation operator, 667 


balanced panel, 447 
baseball players’ salaries 
nonnested models, 198 
testing exclusion restrictions, 139-144 
base group, 223 
base period 
and value, 348 
base value, 348 
beer 
price and demand, 200-201 
taxes and traffic fatalities, 199 
benchmark group, 223 
Bernoulli random variables, 685—686 
best linear unbiased estimator (BLUE), 95 
beta coefficients, 184-185 
between estimators, 463 
bias 
attenuation, 311, 312 
heterogeneity, 440 
omitted variable, 84-89 
simultaneity, in OLS, 538-539 
biased estimators, 717—718 
biased toward zero, 86 
binary explanatory variable, 51-56 
binary random variable, 685 
binary response models, 560. See logit and probit models 
binary variables, 51. See also qualitative information 
defined, 221 
random, 685—686 
binomial distribution, 690 
birth weight 
AFDC participation, 249 
asymptotic standard error, 172 
data scaling, 181-183 
F statistic, 145-146 
IV estimation, 504 
bivariate linear regression model. See simple regression model 
Breusch-Godfrey test, 406 
Breusch-Pagan test, 473 
for heteroskedasticity, 270 


Cc 


calculus, differential, 678—680 
campus crimes, f test, 128-129 
causal effect, 53 
causality, 10-14 
censored regression models, 583-586 
Center for Research in Security Prices (CRSP), 645 
central limit theorem, 724 
CEO salaries 
in multiple regressions 
motivation for multiple regression, 69-70 
nonnested models, 198—199 
predicting, 207-209 
writing in population form, 80 
returns on equity and 
fitted values and residuals, 32 
goodness-of-fit, 35 
OLS Estimates, 29-30 
sales and, constant elasticity model, 39 
ceteris paribus, 10-14, 72-73 
multiple regression, 99—100 


Index 813 


chemical firms, nonnested models, 198 
chemical imports. See antidumping filings and chemical imports 
chi-square distribution 
critical values table, 790 
discussions, 708, 757 
Chow Statistic, 238 
Chow tests 
differences across groups, 238 
heteroskedasticity and, 267 
for panel data, 450-451 
for structural change across time, 431 
cigarettes. See smoking 
city crimes. See also crimes 
law enforcement and, 13 
panel data, 9-10 
classical errors-in-variables (CEV), 311 
classical linear model (CLM) assumptions, 118 
clear-up rate, distributed lag estimation, 443—444 
clusters, 481—482 
effect, 481 
sample, 481 
Cochrane-Orcutt (CO) estimation, 410 
coefficient of determination, 35. See R-squareds 
cointegration, 616—620 
college admission, omitting unobservables, 305 
college GPA 
beta coefficients, 184-185 
collinearity, perfect, 80-82 
fitted values and intercept, 74 
gender and, 237-239 
goodness-of-fit, 77 
heteroskedasticity-robust F statistic, 266-267 
interaction effect, 193-194 
interpreting equations, 72 
with measurement error, 312 
partial effect, 73 
population regression function, 23 
predicted, 202-204 
with single dummy variable, 225 
t test, 127 
college proximity, as IV for education, 507—508 
colleges, junior vs. four-year, 136-138 
column vectors, 750 
commute time and freeway width, 742-743 
compact discs, demand for, 732 
complete cases estimator, 314 
complete cases indicator, 477 
composite error, 440 
term, 470 
Compustat, 645 
computer ownership 
college GPA and, 225 
determinants of, 286 
computers, grants to buy 
reducing error variance, 200-201 
R-squared size, 195-196 
computer usage and wages 
with interacting terms, 233 
proxy variable in, 302-303 
conceptual framework, 652 
conditional distributions 
features, 691—697 
overview, 688, 690-692 
conditional expectations, 700-704 
conditional forecasts, 623 
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conditional independence, 100 
conditional median, 321—323 
conditional variances, 704 
confidence intervals 
95%, rule of thumb for, 731 
asymptotic, 171 
asymptotic, for nonnormal populations, 732-733 
hypothesis testing and, 741-742 
interval estimation and, 727—733 
main discussions, 134—135, 727-728 
for mean from normally distributed population, 729-731 
for predictions, 201-203 
consistency of estimators, in general, 721-723 
consistency of OLS 
in multiple regressions, 164-168 
sampling selection and, 588-589 
in time series regressions, 370-373, 395 
consistent tests, 743 
constant dollars, 348 
constant elasticity model, 38, 81, 676 
constant term, 21 
consumer price index (CPI), 345 
consumption. See under family income 
contemporaneously exogenous variables, 340 
continuous random variables, 687—688 
control group, 53, 225 
control variable, 21. See also independent variables 
corner solution response, 560 
corrected R-squareds, 196-199 
correlated random effects, 474—477 
correlation, 22—23 
coefficients, 698-699 
counterfactual reasoning, 10-14 
count variables, 578, 579 
county crimes, multi-year panel data, 449-450 
covariances, 697—698 
stationary processes, 367-368 
covariates, 246 
crimes. See also arrests 
on campuses, f test, 128-129 
in cities, law enforcement and, 13 
in cities, panel data, 9-10 
clear-up rate, 443-444 
in counties, multi-year panel data, 449-450 
earlier data, use of, 303-304 
econometric model of, 4—5 
economic model of, 3, 174, 295-297 
functional form misspecification, 295-297 
housing prices and, beta coefficients, 190-191 
LM statistic, 174 
prison population and, SEM, 551 
unemployment and, two-period panel data, 439-444 
criminologists, 644 
critical values 
discussions, 122, 735 
tables of, 786-790 
crop yields and fertilizers 
causality, 11, 12 
simple equation, 21-22 
cross-sectional analysis, 649 


cross-sectional data. See also panel data; pooled cross sections; 


regression analysis 
Gauss-Markov assumptions and, 88, 376 
main discussion, 5—7 
time series data vs., 334-335 


cumulative areas under standard normal distribution, 784-785 
cumulative distribution functions (cdf), 687-688 

cumulative effect, 338 

current dollars, 348 

cyclical unemployment, 375 


data 
collection, 645-648 
economic, types of, 5-12 
experimental vs. nonexperimental, 2 
frequency, 7 
data issues. See also misspecification 
measurement error, 308-313 
missing data, 313-315 
multicollinearity, 89-92, 313 
nonrandom samples, 315-316 
outliers and influential observations, 317—321 
random slopes, 306-307 
unobserved explanatory variables, 299-306 
data mining, 650 
data scaling, effects on OLS statistics, 181-185 
Davidson-MacKinnon test, 298, 299 
deficits. See interest rates 
degrees of freedom (df) 
chi-square distributions with n, 708 
for fixed effects estimator, 464 
for OLS estimators, 94 
dependent variables. See also regression analysis; specific event 
studies 
defined, 21 
measurement error in, 310-313 
derivatives, 673 
descriptive statistics, 667 
deseasonalizing data, 359 
detrending, 356-357 
diagonal matrices, 750 
Dickey-Fuller distribution, 611 
Dickey-Fuller (DF) test, 611-614 
augmented, 612 
difference-in-differences estimator, 432, 437 
difference in slopes, 233-236 
difference-stationary processes, 380 
differencing 
panel data 
with more than two periods, 447—451 
two-period, 439-444 
serial correlation and, 414-415 
differential calculus, 678—680 
diminishing marginal effects, 673 
discrete random variables, 685—686 
disturbance terms, 4, 21, 69 
disturbance variances, 45 
downward bias, 86 
drug usage, 246 
drunk driving laws and fatalities, 446 


dummy variables, 51. See also qualitative information; year dummy 


variables 
defined, 221 
regression, 466-467 
trap, 223 
duration analysis, 584-586 
Durbin-Watson test, 403-404 
dynamically complete models, 382-385 


earnings of veterans, IV estimation, 503 
EconLit, 643, 644 
econometric analysis in projects, 648-651 
econometric models, 4—5. See also econometric models 
econometrics, 1-2. See also specific topics 
economic growth and government policies, 7 
economic models, 2—5 
economic significance. See practical significance 
economic vs. statistical significance, 132-136, 
742-743 
economists, types of, 643, 644 
education 
birth weight and, 145-146 
fertility and 
2SLS, 521 
with discrete dependent variables, 249-250 
independent cross sections, 428-429 
gender wage gap and, 429-430 
IV for, 498, 507-508 
logarithmic equation, 677 
return to 
2SLS, 511 
differencing, 480 
fixed effects estimation, 466 
independent cross sections, 429-430 
IQ and, 301-302 
TV estimation, 501 
over time, 429—430 
smoking and, 280-281 
testing for endogeneity, 516 
testing overidentifying restrictions, 518 
wages and (See under wages) 
women and, 239-241 (See also under women 
in labor force) 
efficiency 
asymptotic, 175-176 
of estimators in general, 719—720 
of OLS with serially correlated errors, 395-396 
efficient markets hypothesis (EMH) 
asymptotic analysis example, 374-375 
heteroskedasticity and, 416-417 
elasticity, 39, 676-677 
elections. See voting outcomes 
EMH. See efficient markets hypothesis (EMH) 
empirical analysis, 651 
data collection, 645-648 
econometric analysis, 648-651 
literature review, 644-645 
posing question, 642-644 
sample projects, 658-663 
steps in, 2-5 
writing paper, 651-658 
employment and unemployment. See also wages 
arrests and, 243 
crimes and, 439-444 
enterprise zones and, 449 
estimating average rate, 716 
forecasting, 625, 628, 630 
inflation and (See under inflation) 
in Puerto Rico 
logarithmic form, 345-346 
time series data, 7 
women and. (See women in labor force) 
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endogenous explanatory variables, 495. See also instrumental 
variables; simultaneous equations models; two stage least 
squares 
defined, 82, 294 
in logit and probit models, 571 
sample selection and, 592 
tesing for, 515-516 
endogenous sample selection, 315 
endogenous variables, 536 
Engle-Granger test, 617, 618 
Engle-Granger two-step procedure, 622 
enrollment, ź test, 128-129 
enterprise zones 
business investments and, 736-737 
unemployment and, 449 
error correction models, 620-622 
errors-in-variables problem, 495, 514-515 
error terms, 4, 21, 69 
error variances 
adding regressors to reduce, 200-201 
defined, 45, 89 
estimating, 48-50 
estimated GLS. See feasible GLS 
estimation and estimators. See also first differencing; fixed effects; 
instrumental variables; logit and probit models; ordinary least 
squares (OLS); random effects; Tobit model 
asymptotic sample properties of, 721-724 
changing independent variables simultaneously, 74 
defined, 715 
difference-in-difference-in-differences, 437 
difference-in-differences, 432, 434 
finite sample properties of, 715-720 
language, 96-97 
method of moments approach, 25—26 
misspecifying models, 84—89 
sampling distributions of OLS estimators, 117—120 
event studies, 347, 349-350 
Excel, 647 
excluding relevant variables, 84—89 
exclusion restrictions, 139 
for 2SLS, 509 
general linear, 148-149 
Lagrange multiplier (LM) statistic, 172-174 
overall significance of regressions, 147 
for SEM, 545, 546 
testing, 139-144 
exogenous explanatory variables, 82, 507 
exogenous sample selection, 315, 589 
exogenous variables, 536 
expectations augmented Phillips curve, 375-376, 403, 404 
expectations hypothesis, 14 
expected values, 691—693, 756 
experience 
wage and 
causality, 12 
interpreting equations, 73 
motivation for multiple regression, 67 
omitted variable bias, 87 
partial effect, 679 
quadratic functions, 188—190, 674 
women and, 239-241 
experimental data, 2 
experimental group, 225 
experiments, defined, 684 
explained sum of squares (SSE), 34, 70, 76-77 
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explained variables, 21. See also independent variables 
explanatory variables, 21. See also independent variables 
exponential function, 677 

exponential smoothing, 623 

exponential trend, 352-353 


F 


falsification test, 479 
family income. See also savings 
birth weight and 
asymptotic standard error, 172 
data scaling, 181-183 
college GPA, 312 
consumption and 
motivation for multiple regression, 68, 69 
perfect collinearity and, 81 
farmers and pesticide usage, 200 
F distribution 
critical values table, 787-789 
discussions, 709, 710, 757 
feasible GLS 


with heteroskedasticity and AR(1) serial correlations, 419 


main discussion, 277—282 
OLS vs., 411-413 
Federal Bureau of Investigation, 645 
fertility rate 
education and, 521 
FDL model, 336-338 
forecasting, 634 
over time, 428—429 
tax exemption and 
with binary variables, 346-347 
cointegration, 618-619 
first differences, 385-386 
serial correlation, 384 
trends, 355 
fertility studies, with discrete dependent variable, 249-250 
fertilizers 
land quality and, 23 
soybean yields and 
causality, 11, 12 
simple equation, 21-22 
final exam scores 
interaction effect, 193-194 
skipping classes and, 498—499 
financial wealth 
nonrandom sampling, 315-316 
and WLS estimation, 276-278, 282 
finite distributed lag (FDL) models, 336-338, 372, 
443-444 
finite sample properties 
of estimators, 715—720 
of OLS in matrix form, 763—766 
firm sales. See sales 
first-differenced equations, 441 
first-differenced estimator, 441 
first differencing 
defined, 441 
fixed effects vs., 467-469 
I(1) time series and, 380 
panel data, pitfalls in, 451 
first order autocorrelation, 381 
first order conditions, 27, 71, 680, 762 


fitted values. see also ordinary least squares (OLS) 
in multiple regressions, 74-75 
in simple regressions, 27, 32 
fixed effects 
defined, 439 
dummy variable regression, 466—467 
estimation, 463—469 
first differencing vs., 467-469 
random effects vs., 473-474 
transformation, 463 
with unbalanced panels, 468-469 
fixed effects model, 440 
forecast error, 622 
forecasting 
multiple-step-ahead, 628-630 
one-step-ahead, 622, 624-627 
overview and definitions, 622—623 
trending, seasonal, and integrated processes, 
631-634 
types of models used for, 623-624 
forecast intervals, 624 
free throw shooting, 690-691 
freeway width and commute time, 742-743 
frequency, data, 7 
frequency distributions, 401(k) plans, 169 
Frisch-Waugh theorem, 75 
F statistics. See also F tests 
defined, 141 
heteroskedasticity-robust, 266-267 
F tests. See also Chow tests; F statistics 
F and t statistics, 144-145 
functional form misspecification and, 295-299 
general linear restrictions, 148-149 
LM tests and, 174 
p-values for, 146-147 
reporting regression results, 149-150 
R-squared form, 145-146 
testing exclusion restrictions, 139-144 
functional forms 
in multiple regressions 
with interaction terms, 192—194 
logarithmic, 186-188 
misspecification, 295-299 
quadratic, 188-192 
in simple regressions, 36—40 
in time series regressions, 345-346 
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Gaussian distribution, 704 
Gauss-Markov assumptions 
cross-sectional data, 88 
for multiple linear regressions, 79-83, 95-96 
for simple linear regressions, 40-48 
for time series regressions, 342-344 
Gauss-Markov Theorem 
for multiple linear regressions, 95-96 
for OLS in matrix form, 765-766 
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oversampling, 316 
wage gap, 429-430 
gender gap 
independent cross sections, 429-430 
panel data, 429-430 


generalized least squares (GLS) estimators 
for AR(1) models, 409-414 
with heteroskedasticity and AR(1) serial correlations, 419 
when heteroskedasticity function must be estimated, 278-283 
when heteroskedasticity is known up to a multiplicative constant, 
274-275 
generalized least squares procedures, 400 
geometric distributed lag (GDL), 607-608 
GLS estimators. See generalized least squares (GLS) estimators 
Goldberger, Arthur, 91 
goodness-of-fit. See also predictions; R-squareds 
change in unit of measurement and, 37 
in multiple regressions, 76-77 
overemphasizing, 199-200 
percent correctly predicted, 242, 565 
in simple regressions, 35-36 
in time series regressions, 396 
Google Scholar, 643 
government policies 
economic growth and, 6, 8-9 
GPA. See college GPA 
Granger causality, 626 
Granger, Clive W. J., 164 
gross domestic product (GDP) 
data frequency for, 7 
government policies and, 6 
high persistence, 377-379 
in real terms, 348 
seasonal adjustment of, 358 
unit root test, 614 
group-specific linear time trends, 438 
growth rate, 353 
gun control laws, 246 


HAC standard errors, 399 
Hartford School District, 205 
Hausman test, 473, 474 
Hausman test, 281 
Head Start participation, 245 
Heckit method, 591 
heterogeneity, 466 
heterogeneity bias, 440 
heterogeneous trend model, 479 
heteroskedasticity. See also weighted least squares estimation 
2SLS with, 518-519 
consequences of, for OLS, 262-263 
defined, 45 
HAC standard errors, 399 
heteroskedasticity-robust procedures, 263—268 
linear probability model and, 284-286 
robust F statistic, 266 
robust LM Statistic, 267 
robust ¢ statistic, 265 
for simple linear regressions, 45-48 
testing for, 269-273 
for time series regressions, 385 
in time series regressions, 415-419 
of unknown form, 263 
heteroskedasticity and autocorrelation consistent (HAC) standard 
errors, 399 
highly persistent time series 
deciding whether I(0) or I(1), 381-382 
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description of, 376-385 
transformations on, 380-382 
histogram, 401(k) plan participation, 169 
homoskedasticity 
IV estimation, 500, 501 
for multiple linear regressions, 88-89, 95 
for OLS in matrix form, 764 
for time series regressions, 342-344, 373-374 
in wage equation, 46 
hourly wages. See wages 
housing prices and expenditures 
general linear restrictions, 148-149 
heteroskedasticity 
BP test, 270-271 
White test, 271-273 
incinerators and 
inconsistency in OLS, 167 
pooled cross sections, 431—434 
income and, 669 
inflation, 609-610 
investment and 
computing R-squared, 356-357 
spurious relationship, 354-355 
over controlling, 200 
with qualitative information, 226-227 
RESET, 297-298 
savings and, 537-538 
hypotheses. See also hypothesis testing 
about single linear combination of parameters, 136-139 
after 2SLS estimation, 513 
expectations, 14 
language of classical testing, 132 
in logit and probit models, 564-565 
multiple linear restrictions (See F tests) 
residual analysis, 205 
stating, in empirical analysis, 4 
hypothesis testing 
about mean in normal population, 735-736 
asymptotic tests for nonnormal populations, 738 
computing and using p-values, 738-740 
confidence intervals and, 741-742 
in matrix form, Wald statistics for, 771 
overview and fundamentals, 733-735 
practical vs. statistical significance, 742-743 
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idempotent matrices, 755 
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defined, 499 
in systems with three or more equations, 545-546 
in systems with two equations, 540-543 
identified equation, 540 
identity matrices, 750 
idiosyncratic error, 440 
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incinerators and housing prices 
inconsistency in OLS, 167 
pooled cross sections, 431-434 
including irrelevant variables, 83-84 
income. See also wages 
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income. See also wages (continued) 
housing expenditure and, 669 
PIH, 548-549 
savings and (See under savings) 
inconsistency in OLS, deriving, 167-168 
inconsistent estimators, 721 
independence, joint distributions and, 688—690 
independently pooled cross sections. See also pooled cross sections 
across time, 427—431 
defined, 426 
independent variables. See also regression analysis; specific event 
studies 
changing simultaneously, 74 
defined, 21 
measurement error in, 310-313 
in misspecified models, 84-89 
random, 689 
simple vs. multiple regression, 67—70 
index numbers, 348-349 
index of industrial production, index of (IIP), 348 
indicator function, 561 
infant mortality rates, outliers, 320-321 
inference 
in multiple regressions 
confidence intervals, 134—136 
of OLS with serially correlated errors, 395-396 
statistical, with IV estimator, 500-503 
in time series regressions, 344-345 
infinite distributed lag models (IDL), 605-610 
inflation 
from 1948 to 2003, 335 
examples of models, 335-338 
openness and, 543-545 
random walk model for, 377 
unemployment and 
expectations augmented Phillips curve, 375-376 
forecasting, 625 
static Phillips curve, 336, 344-345 
unit root test, 613 
influential observations, 317—321 
information set, 622 
in-sample criteria, 627 
instrumental variables 
computing R-squared after IV estimation, 505 
in multiple regressions, 505-509 
overview and definitions, 496, 497, 499 
properties, with poor instrumental variable, 503-505 
in simple regressions, 496-505 
solutions to errors-in-variables problems, 514-515 
statistical inference, 500-503 
integrated of order zero/one processes, 380-382 
integrated processes, forecasting, 631-634 
interaction effect, 192-194 
interaction terms, 232—233 
intercept parameter, 21 
intercepts. See also OLS estimators; regression analysis 
change in unit of measurement and, 36-37 
defined, 21, 668 
in regressions on a constant, 51 
in regressions through origin, 50-51 
intercept shifts, 222 
interest rates 
differencing, 415 
inference under CLM assumptions, 345 
T-bill (See T-bill rates) 
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interval estimation, 714, 727-728 
inverse Mills ratio, 573 
inverse of matrix, 753 
IQ 
ability and, 301-302, 304-305 
nonrandom sampling, 315-316 
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IV. See instrumental variables 
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sample model 
as self-selection problem, 3 
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joint distributions 
features of, 691-697 
independence and, 688—690 
joint hypotheses tests, 139 
jointly statistically significant/insignificant, 142 
joint probability, 688 
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Koyck distributed lag, 607-608 
kurtosi, 697 
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labor supply and demand, 535-536 
labor supply function, 677 
lag distribution, 337 
lagged dependent variables 
as proxy variables, 303-304 
serial correlation and, 396-398 
lagged endogenous variables, 547 
lagged explanatory variables, 338 
Lagrange multiplier (LM) statistics 
heteroskedasticity-robust, 267-268 (See also heteroskedasticity) 
main discussion, 172—174 
land quality and fertilizers, 23 
large sample properties, 721-723 
latent variable models, 561 
law enforcement 
city crime levels and (causality), 13 
murder rates and (SEM), 537 
law of iterated expectations, 703 
law of large numbers, 722 
law school rankings 
as dummy variables, 232 
residual analysis, 205 
leads and lags estimator, 620 
least absolute deviations (LAD) estimation, 321-323 
least squares estimator, 726 
likelihood ratio statistic, 564 
likelihood ratio (LR) test, 564 


limited dependent variables 
corner solution response (See Tobit model) 
limited dependent variables (LDV) 
censored and truncated regression models, 582-587 
count response, Poisson regression for, 578-582 
overview, 559-560 
sample selection corrections, 588-593 
linear functions, 668-669 
linear independence, 754 
linear in parameters assumption 
for OLS in matrix form, 763 
for simple linear regressions, 40, 44 
for time series regressions, 339-340 
linearity and weak dependence assumption, 370-371 
linear probability model (LPM). See also limited 
dependent variables 
heteroskedasticity and, 284—286 
main discussion, 239-244 
linear regression model, 40, 70 
linear relationship among independent variables, 89-92 
linear time trend, 351-352 
literature review, 644-645 
loan approval rates 
F and ż statistics, 164 
multicollinearity, 91 
program evaluation, 245 
logarithms 
in multiple regressions, 186-188 
natural, overview, 777-780 
predicting y when log(y) is dependent, 206-208 
qualitative information and, 226-228 
real dollars and, 349 
in simple regressions, 37—39 
in time series regressions, 345-346 
log function, 674 
logit and probit models 
interpreting estimates, 565-571 
maximum likelihood estimation of, 563-564 
specifying, 560-563 
testing multiple hypotheses, 564-565 
log-likelihood function, 564 
longitudinal data. See panel data 
long-run elasticity, 346 
long-run multiplier. See long-run propensity (LRP) 
long-run propensity (LRP), 338 
loss functions, 622 
lunch program and math performance, 44—45 


macroeconomists, 643 
marginal effect, 668 
marital status. See qualitative information 
martingale difference sequence, 610 
martingale functions, 623 
matched pairs samples, 481 
mathematical statistics. See statistics 
math performance and lunch program, 44—45 
matrices. See also OLS in matrix form 
addition, 750 
basic definitions, 749-750 
differentiation of linear and quadratic forms, 755 
idempotent, 755 
linear independence and rank of, 754 
moments and distributions of random vectors, 756-757 
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multiplication, 751-752 
operations, 750-753 
quadratic forms and positive definite, 754-755 
matrix notation, 762 
maximum likelihood estimation (MLE), 563-564, 
725-726 
with explanatory variables, 602 
mean absolute error (MAE), 628 
mean independent, 23 
mean squared error (MSE), 720 
mean, using summation operator, 667—668 
measurement error 
TV solutions to, 514-515 
men, return to education, 502 
properties of OLS under, 308-313 
measures of association, 697 
measures of central tendency, 694—696 
measures of variability, 695 
median, 668, 694 
method of moments approach, 25-26, 725 
micronumerosity, 91 
military personnel survey, oversampling in, 316 
minimum variance unbiased estimators, 118, 726, 768 
minimum wages 
causality, 13 
employment/unemployment and 
AR(1) serial correlation, testing for, 405 
detrending, 356-357 
logarithmic form, 345-346 
SC-robust standard error, 400 
in Puerto Rico, effects of, 7-8 
minorities and loans. See loan approval rates 
missing at random, 315 
missing completely at random (MCAR), 314 
missing data, 313-315 
misspecification 
in empirical analysis, 650 
functional forms, 295—299 
unbiasedness and, 84-89 
variances, 92—93 
motherhood, teenage, 480 
moving average process of order one [MA(1)], 368 
multicollinearity, 313 
2SLS and, 511 
main discussion, 89-92 
multiple hypotheses tests, 139 
multiple linear regression (MLR) model, 69 
multiple regression analysis. See also data issues; estimation and 
estimators; heteroskedasticity; hypotheses; ordinary least 
squares (OLS); predictions; R-squareds 
adding regressors to reduce error variance, 200-201 
advantages over simple regression, 66-70 
causal effects and policy analysis, 151-152 
ceteris paribus, 99-100 
confidence intervals, 134-136 
efficient markets, 98—99 
interpreting equations, 73 
null hypothesis, 120 
omitted variable bias, 84—89 
over controlling, 199-200 
policy analysis, 100 
potential outcomes, 100 
prediction, 98 
trades off variable, 99 
treatment effect, 100 
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multiple regressions. See also qualitative information 
beta coefficients, 184 
hypotheses with more than one parameter, 136—139 
misspecified functional forms, 295 
motivation for multiple regression, 67 
nonrandom sampling, 315-316 
normality assumption, 119 
productivity and, 382 
quadratic functions, 188—192 
with qualitative information 
of baseball players, race and, 235-236 
computer usage and, 233 
with different slopes, 233-236 
education and, 233-235 
gender and, 222-228, 233-235 
with interacting terms, 232 
law school rankings and, 232 
with log(y) dependent variable, 226-228 
marital status and, 232—233 
with multiple dummy variables, 228-232 
with ordinal variables, 230-231 
physical attractiveness and, 231 
random effects model, 472 
random slope model, 305-306 
reporting results, 149-150 
t test, 122 
with unobservables, general approach, 304-305 
with unobservables, using proxy, 299-306 
working individuals in 1976, 6 
multiple restrictions, 139 
multiple-step-ahead forecasts, 623, 628-630 
multiplicative measurement error, 309 
multivariate normal distribution, 756-757 
municipal bond interest rates, 230-231 
murder rates 
SEM, 537 
static Phillips curve, 336 


natural experiments, 434, 503 
natural logarithms, 777-780. See also logarithms 
netted out, 75 
Newey-West standard errors, 400, 407—408 
nominal dollars, 348 
nominal vs. real, 348 
nonexperimental data, 2 
nonlinear functions, 672-678 
nonlinearities, incorporating in simple regressions, 37—39 
nonnested models 

choosing between, 197—199 

functional form misspecification and, 298-299 
nonrandom samples, 315-316, 588 
nonstationary time series processes, 367-368 
no perfect collinearity assumption 

form, 763 

for multiple linear regressions, 80-83 

for time series regressions, 340, 371 
normal distribution, 704—708 
normality assumption 

for multiple linear regressions, 117—120 

for time series regressions, 344 
normality of errors assumption, 767 
normality of estimators in general, 

asymptotic, 723-724 


normality of OLS, asymptotic 
in multiple regressions, 168—174 
for time series regressions, 373-376 
normal sampling distributions 
for multiple linear regressions, 119-120 
for time series regressions, 344-345 
no serial correlation assumption. See also serial correlation 
for OLS in matrix form, 764-765 
for time series regressions, 342-344, 373-374 
n-R-squared statistic, 173 
null hypothesis, 120-122, 734. See also hypotheses 
numerator degrees of freedom, 141 
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observational data, 2 
OLS and Tobit estimates, 575-577 
OLS asymptotics 
in matrix form, 769-771 
in multiple regressions 
consistency, 164-168 
efficiency, 175-176 
overview, 163-164 
in time series regressions 
consistency, 370-376 
OLS estimators. See also heteroskedasticity 
defined, 40 
in multiple regressions 
efficiency of, 95-96 
variances of, 87-95 
sampling distributions of, 117—120 
in simple regressions 
expected value of, 79-87 
unbiasedness of, 83 
variances of, 45-50 
in time series regressions 
sampling distributions of, 344-345 
unbiasedness of, 339-345 
variances of, 342—344 
OLS in matrix form 
asymptotic analysis, 769-771 
finite sample properties, 763-766 
overview, 760-762 
statistical inference, 767—768 
Wald statistics for testing multiple hypotheses, 771 
OLS intercept estimates, defined, 71-72 
OLS regression line. See also ordinary least squares (OLS) 
defined, 28, 71 
OLS slope estimates, defined, 71 
omitted variable bias. See also instrumental variables 
general discussions, 84-89 
using proxy variables, 299-305 
omitted variables, 495 
one-sided alternatives, 735 
one-step-ahead forecasts, 622, 624-627 
one-tailed tests, 122, 736. See also t tests 
online databases, 646 
online search services, 644 
order condition, 513, 541 
ordinal variables, 230-231 
ordinary least squares (OLS) 
cointegration and, 619-620 


comparison of simple and multiple regression estimates, 75—76 


consistency (See consistency of OLS) 
logit and probit vs., 568-570 


in multiple regressions 
algebraic properties, 70-78 
computational properties, 70-78 
effects of data scaling, 181-185 
fitted values and residuals, 74 
goodness-of-fit, 76-77 
interpreting equations, 71-72 


Lagrange multiplier (LM) statistic, 172-174 


measurement error and, 308-313 
normality, 168-174 
partialled out, 75 
regression through origin, 79 
statistical properties, 79-87 
Newey-West standard errors, 407-408 
Poisson vs., 580-582 
with serially correlated errors, properties of, 
395-398 
in simple regressions 
algebraic properties, 32-34 
defined, 27 
deriving estimates, 24-32 
statistical properties, 45-50 
unbiasedness of, 40—45 
units of measurement, changing, 36-37 
simultaneity bias in, 538-539 
in time series regressions 
correcting for serial correlation, 409-413 
FGLS vs., 411-413 
finite sample properties, 339-345 
normality, 373-376 
SC-robust standard errors, 398-401 
Tobit vs., 575-577 
outliers 
guarding against, 321-323 
main discussion, 317—321 
out-of-sample criteria, 627 
overall significance of regressions, 147 
over controlling, 199-200 
overdispersion, 580 
overidentified equations, 546 
overidentifying restrictions, testing, 516-518 
overspecifying the model, 84 
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pairwise uncorrelated random variables, 
699-700 
panel data 
applying 2SLS to, 521-522 


applying methods to other structures, 480—483 


correlated random effects, 474—477 


differencing with more than two periods, 447—451 


fixed effects, 463—469 

independently pooled cross sections vs., 427 
organizing, 444 

overview, 9-10 

pitfalls in first differencing, 451 

policy analysis with, 477—479 

random effects, 469—474 


simultaneous equations models with, 549-551 


two-period, analysis, 444—446 
two-period, policy analysis with, 444—446 
unbalanced, 468—469 
Panel Study of Income Dynamics, 645 
parallel trends assumption, 436 
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parameters 
defined, 4, 714 
estimation, general approach to, 724-726 
partial derivatives, 679 
partial effect, 72-74 
partial effect at average (PEA), 566 
partialled out, 75 
partitioned matrix multiplication, 752-753 
percentage point change, 672 
percentages, 671—672 
change, 671 
percent correctly predicted, 242, 565 
perfect collinearity, 80-82 
permanent income hypothesis (PIH), 548-549 
pesticide usage, over controlling, 200 
physical attractiveness and wages, 231 
pizzas, expected revenue, 693 
plug-in solution 
to the omitted variables problem, 300 
point estimates, 714 
point forecasts, 624 
Poisson distribution, 579, 580 
Poisson regression model, 578—580, 582 
policy analysis 
with pooled cross sections, 431—439 
with qualitative information, 225, 244-249 
with two-period panel data, 444—446 
pooled cross sections. See also independently pooled cross sections 
applying 2SLS to, 521-522 
overview, 8 
policy analysis with, 431—439 
pooled OLS (POLS) 
cluster samples, 482 
random effects vs., 473 
population, defined, 714 
population model, defined, 79 
population regression function (PRF), 23 
population R-squareds, 196 
positive definite and semi-definite matrices, defined, 755 
poverty rate 
in absence of suitable proxies, 305 
excluding from model, 86 
power of test, 734 
practical significance, 132 
practical vs. statistical significance, 132-136, 742-743 
Prais-Winsten (PW) estimation, 410—412 
predetermined variables, 547 
predicted variables, 21. See also dependent variables 
prediction error, 203 
predictions 
confidence intervals for, 201—204 
with heteroskedasticity, 283-284 
residual analysis, 205 
for y when log(y) is dependent, 206-208 
predictor variables, 21, 23. See also dependent variables 
price index, 348-349 
prisons 
population and crime rates, 551 
recidivism, 584—585 
probability. See also conditional distributions; joint distributions 
features of distributions, 691—697 
joint, 688 
normal and related distributions, 704-708 
overview, 684 
random variables and their distributions, 684-688 
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probability density function (pdf), 686 
probability limits, 721-723 
probit model. See logit and probit models 
productivity. See worker productivity 
program evaluation, 225, 244-249 
projects. See empirical analysis 
property taxes and housing pri, 8 
proxy variables, 299-306 

and potential outcomes, 305-306 
pseudo R-squareds, 566 
public finance study researchers, 643 
Puerto Rico, employment in 

detrending, 356-357 

logarithmic form, 345-346 

time series data, 7—8 
p-values 

computing and using, 738-740 

for ¢ tests, 130-132 
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quadratic form for matrices, 754-756 

quadratic function, 672-674 

quadratic time trends, 353 

qualitative information. See also linear probability model (LPM) 

in multiple regressions 
allowing for different slopes, 233-236 
binary dependent variable, 239-244 
describing, 221-222 
discrete dependent variables, 249-250 
interactions among dummy variables, 232-233 
with log(y) dependent variable, 226-228 
multiple dummy independent variables, 228-232 
ordinal variables, 230-231 
overview, 220-221 
policy analysis and program evaluation, 244-249 
proxy variables, 302-303 
single dummy independent variable, 222-228 
testing for differences in regression functions across groups, 
237-239 
in time series regressions 

seasonal, 358-360 

quantile regression, 323 

quasi-demeaned data, 470 

quasi-differenced data, 409 

quasi-experiment, 434 

quasi-(natural) experiments, 434 

quasi-likelihood ratio statistic, 581 

quasi-maximum likelihood estimation (QMLE), 580, 768 


Rj, 89-92 
race 
arrests and, 244 
baseball player salaries and, 235-236 
discrimination in hiring 
asymptotic confidence interval, 732-733 
hypothesis testing, 738 
p-value, 741 
random assignment, 54 
random coefficient model, 305-306 
random effects 
correlated, 474-477 
estimator, 471 


fixed effects vs., 473-474 
main discussion, 469-474 
pooled OLS vs., 473 
randomized controlled trial (RCT), 54 
random sampling 
assumption 
for multiple linear regressions, 80 
for simple linear regressions, 40-42, 44 
cross-sectional data, 5—7 
defined, 715 
random slope model, 305-306 
random trend model, 479 
random variables, 684—688 
random vectors, 756 
random walks, 376 
rank condition, 513, 541-543 
rank of matrix, 754 
rational distributed lag models (RDL), 608-610 
R&D and sales 
confidence intervals, 135-136 
nonnested models, 197—199 
outliers, 317—318 
real dollars, 348 
recidivism, duration analysis, 584-586 
reduced form equations, 507, 539 
reduced form errors, 539 
reduced form parameters, 539 
regressand, 21. See also dependent variables 
regression adjustment, 246 
regression analysis, 50-51. See also multiple regression analysis; 
simple regression model; time series data 
regression on binary explanatory variable, 51-56 
regression specification error test (RESET), 297-298 
regression through origin, 50-52 
regressors, 21, 200-201. See also independent variables 
rejection region, 735 
rejection rule, 122. See also t tests 
relative change, 671 
relative efficiency, 719-720 
relevant variables, excluding, 84-89 
reporting multiple regression results, 149-150 
rescaling, 181-183 
residual analysis, 205 
residuals. See also ordinary least squares (OLS) 
in multiple regressions, 74, 318-319 
in simple regressions, 27, 32, 48 
studentized, 318 
residual sum of squares, 76 
residual sum of squares (SSR). See sum of squared residuals 
response probability, 240, 560 
response variable, 21. See also dependent variables 
restricted model, 140-141. See also F tests 
restricted regression adjustment (RRA), 247 
retrospective data, 2 
returns on equity and CEO salaries 
fitted values and residuals, 32 
OLS Estimates, 29-30 
in simple regressions, 35 
robust regression, 323 
rooms and housing prices 
beta coefficients, 190-191 
interaction effect, 192-194 
quadratic functions, 190-192 
residual analysis, 205 
root mean squared error (RMSE), 50, 94, 627-628 


row vectors, 749 
R-squareds. See also predictions 
adjusted, 196-199, 396 
after IV estimation, 505 
change in unit of measurement and, 37 
in fixed effects estimation, 465, 466 
for F statistic, 145-146 
in multiple regressions, main discussion, 76-79 
for probit and logit models, 566 
for PW estimation, 410-411 
in regressions through origin, 50-51, 79 
in simple regressions, 35-36 
size of, 195-196 
in time series regressions, 396 
trending dependent variables and, 356-357 
uncentered, 230 
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salaries. See CEO salaries; income; wages 
sales 
CEO salaries and 
constant elasticity model, 39 
nonnested models, 198—199 
motivation for multiple regression, 69-70 
R&D and (See R&D and sales) 
sales tax increase, 672 
sample average, 715 
sample correlation coefficient, 725 
sample covariance, 725 
sample regression function (SRF), 28, 71 
sample selection corrections, 588-593 
sample standard deviation, 723 
sample variation in the explanatory variable 
assumption, 42, 44 
sampling distributions 
defined, 716 
of OLS estimators, 117—120 
sampling, nonrandom, 315-316 
sampling standard deviation, 733 
sampling variances 
of estimators in general, 718-719 
of OLS estimators 
for multiple linear regressions, 88, 89 
for simple linear regressions, 47-48 
sampling variances of OLS estimators 
for simple linear regressions, 47-48 
for time series regressions, 342-344 
savings 
housing expenditures and, 537-538 
income and 
heteroskedasticity, 273-275 
scatterplot, 25 
measurement error, 309 
with nonrandom samples, 315-316 
scalar multiplication, 750-751 
scalar variance-covariance matrices, 764 
scatterplots 
R&D and sales, 318 
savings and income, 25 
wage and education, 27 
school lunch program and math performance, 44—45 
score statistic, 172-174 
scrap rates and job training 
2SLS, 521-522 
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confidence interval, 740-741 
confidence interval and hypothesis testing, 742 
fixed effects estimation, 464—465 
measurement error in, 309-310 
program evaluation, 244 
p-value, 740-741 
statistical vs. practical significance, 133-134 
two-period panel data, 445 
unbalanced panel data, 469 
seasonal dummy variables, 359 
seasonality 
forecasting, 631-634 
serial correlation and, 407 
of time series, 358—360 
seasonally adjusted patterns, 358 
selected samples, 588 
self-selection problems, 245 
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labor supply function, 677 
multiple regressions (See also qualitative information) 
homoskedasticity, 88-89 
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White test for heteroskedasticity, 271-273 
within estimators, 463. See also fixed effects 
within transformation, 463 
women in labor force 
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